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Standards. Consideration is first given to indexes compiled by or 
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derivative indexing is exemplified by key-word-in-context (KWIC) and 
other word-in-context techniques. Advantages, disadvantages, and 
possibilities for modification and improvement are discussed. 
Experiments in automatic assignment indexing '.re summarized* Related 
research efforts in such areas as automatic classification and 
categorization, computer use of thesauri, statistical association 
techniques, and linguistic data processing are described. A major 
question is that of evaluation, particularly in view of evidence o\. 
human inter-indexer inconsistency* It is concluded that indexes based 
on words extracted from text are practical for many purposes today, 
and that automatic assignment indexing and classification experiments 
show promise for future progress. (Author) 
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Foreword 
(1970 Edition) 

Widespread interest in the use of computers in automatic indexing created a demand 
for this publication that led to a recent exhaustion of all stock. While updating and revision 
would have been desirable, other demands have prescribed reissuance with additional mate- 
rial added as appendices. These are a paper updating the field through September 1966 
(Appendix B), and bibliographic citations, pertinent to the subjects in the original text, 
through August 1969 (Appendix C). 



Lewis M. Brans comb 
Director 



Library of Congress Catalog Card Number: 65*60023 
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Foreword 
( 196b Edition) 

The Research Information Center and Advisory Service on Information Processing, 
(RICASIP), which is jointly supported by the National Science Foundation and the National 
Bureau of Standards, is engaged in a continuing program to collect information and main* 
tain current awareness about research and development activities in the field of information 
processing and retrieval. An important responsibility of RICASIP is the preparation of 
state-of-the-art reviews on topics of current interest in various area? of this broad field. 

This report is one of a series intended as contributions toward improved interchange 
of information among those engaged in research and development in this field. The report 
considers new uses of machines and automatic data processing procedures for the compila- 
tion and generation oi' indexes to rhe scientific and technical literature. 

A.V. A s tin, Director 
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AUTOMATIC INDEXING 



A State- of- the -Art Report 
Mary Elizabeth Stevens 



A state-of-the-art survey of automatic indexing systems 
and experiments has been conducted by the Research Informa- 
tion Center and Advisory Service on Information Processing, 
Information Technology Division, Institute for Applied Tech- 
nology, National Bureau of Standards. Consideration is first 
given to indexes compiled by or with the aid of machines, 
including citation indexes. Automatic derivative indexing is 
exemplified by key- word-in- context (KWIC) and other word 7 
in-context techniques. Advantages, disadvantages, and possi- 
bilities for modification and improvement are discussed. 

Experiments in automatic assignment indexing are summarized. 

Related research efforts in such areas as automatic classifi- 
cation and categorization, computer use of thesauri, statistical 
association techniques, and linguistic data processing are 
described. A major question is that of evaluation, particularly 
in view of evidence of human inter -indexer inconsistency. It 
is concluded that indexes based on words extracted from text 
are Practical for many purposes today, and that* automatic 
assignment indexing and classification experiments show 
promise for future progress. 

1. INTRODUCTION 

This report of the Research Information Center and Advisory Service on Information 
Processing (RICASIP) U is one of a series intended as contributions to improved co- 
operation in the fields of information selection systems development, information re- 
trieval research and mechanized translation. In each of these areas, automatic tech- 
niques for linguistic data processing are receiving increased attention. This report 
covers a state-of-the-art sui/ey of current progress in linguistic data processing as 
related to the possibilities of automatic mechanized indexing. Insofar as has been 
practical, the survey of the literature on which this report is based has been made 
through February 1964. 

It has concentrated on the major developments in and related demonstrations of auto- 
matic indexing potentialities. Examples are also given of indexes compiled by machine 
and of potentially related research efforts in such areas as natural language text search- 
ing, statistical association techniques used for search and retrieval, and proposed 
systems for concept processing. There are, undoubtedly, various omissions. Neither 
the inclusion of reports on various specific experiments and techniques nor the omission 
of others is intended to reflect an endorsement as such of those that are included or an 
adverse evaluation of those that are not mentioned. 
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Initiated at the instigation of the National Science Foundation. RICASIP is jointly 
supported by NSF and NBS. 



1. 1 Definitions and Background 

The noun "index" has as its most general meaning "something used or serving to 
point out, a sign, token, or indication", (American College Dictionary) or "that which 
shows, indicates, manifests, or discloses; a token or indication" (Webster's International 
Dictionary, 2nd Edition, unabridged). More specifically, an index is "a pointer or key 
which directs the searcher to recorded information '.'—' ' The terms "index" and "indexing" 
have been used in the fields of library science and documentation with reference to the fact 
that the selection of information pertinent to a particular problem or interest, from all the 
previously recorded information available, involves problems of decision -making based 
on less than the full content or text of each of the records being searched. 

Short of complete scanning of all the possibly relevant material, it is necessary to * 

select or "distill" condensed representations or surrogates 2/ for each item. These 
surrogates are intended to direct the searcher to the most probably pertinent items in a 
collection. The operations known as "indexing" thus involve: 

(1) Choosing clues that will serve to identify, for purposes of later retrieval, a 
particular book, document,- or other recorded item, and 

(2) Either marking on the item itself or recording as a separate item- surrogate 
the tags, labels? or codes representing these clues. 

The second of these two steps can be purely clerical in nature, but the first has been, 
to date, primarily the result of human intellectual efforts in subject content analysis. 

Well-known inadequacies of human indexing operations include both those stemming 
from man himself and those which result from the volume and the character of the 
materials with which he deals. On the human side, there are fundamental questions of 
perception, comprehension and judgment, as well as those of inter -indexer and even intra- 
indexer consistency. In addition, the indexer is asked to guess in advance what others 
will ask for, understand, and find relevant on future search. He is even asked, in effect, 
to anticipate the language of future inquiries. Thus, a somewhat facetious definition of the 
noun "index" has a considerable sting of truth: "A system of analyzing information in 
which the method used to choose categories is carefully hidden from the user. An attempt 
to outguess the future. " 

The nature of the material to be indexed, especially m the area of scientific informa- 
tion, raises a number of crucial problems. The still increasing spate of production of 
technical literature and reports poses not only the problems of sheer volume in terms of 
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Crane and Bernier, 1958 [144], p.513. 

(Note; Full citations of references are given in the bibliography by author and by 
numerical order of the figures in brackets. ) 

2 / 

See, for example, R, E. Wyllys, 1962 [65l3, for discussion of the two-fold purposes 
of condensed representations; to serve a search-tool function on the one hand and 
a content- revealing one on the other. 

3/ 

Vanby, 1963 [622], p. 143, 
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mar power requirements and time necessary to produce indexes, but also problems of glut 
in terms of raan-, nirs necessary for the individual scientist to maintain awareness of 
what is going on in his field. There are major problems created by newly emerging fields 
of effort, new interdisciplinary areas of interest, and dynamically evolving terminology. 

Ir creasing specialization, on the ether hand, brings out additional difficulties in finding 
what has been done elsewhere that might be applicable to one's own work and in avoiding 
wasteful duplication of effort, with their own attendant problems of terminology. 

All these problems are aggravated by the increasingly critical urgency which should 
apply to making all useful information available to those who need it as promptly and as 
selectively as possible. Recognition of this urgency and of the inadequacies of present 
solutions has therefore prompted consideration of the feasibility of using machines to 
assist in the indexing process. 



The term 11 me chan; zed indexing" signifies the accomplishment of some or all of the 
indexing operations by mechanized means. The term includes the use of machines to 
prepare and compile indexes, and to sort, assemble, duplicate and interfile catalog cards 
carr>ing index entries. In this report, however, we shall be concerned primarily with 
the area of automatic indexing, that is, the use of machines to extract or assign index 
terms without human intervention once programs or procedural rules have been estab- 
lished. This term is chosen in preference to auto- indexing, as originally suggested by . 
Luhn (1961 [373]) for the reasons set forth by Bar-Hillel, _/ and to machine indexing — 
due to possible confusion with machine tool operations. Automatic indexing has been used 
by such workers in the field as Gardin (1963 [209]), Kennedy (1962 [310]), Maron (1961 
[395]), Swanson (1962 [584]), and Wyllys (1963 [653]). 



For obvious reasons, we also subsume under this term any specifically "clerical" 
(Fairtliorne, 1956 [188], 1956 [189], 1961 [190] and hence machinable operations that 
can similarly be substituted for human intellectual effort. There is nothing that machines 
can do which people cannot do except for limitations of time, cost, or availability of 
appropriate resources. Thus, we shall consider "machine-like indexing by people" 
(O'Connor, 1961 [447]; Montgomery and Swanson, 1962 [421]) as falling properly within 
the scope of automatic indexing, especially in the sense of "... deciding in a mechanical 
way to which category (subject or field of knowledge) a given document belongs . . . decid- 
ing automatically what a given document is 'about'. " U 

The principle of indexing, that is, of using subject- content clues and item surrogates 
as substitutes for searches based on perusal of the full contents, has a history of several 
millenia. In ancient Sumaria and Babylon, clay tablets were sometimes covered with a 
thin ciay envelope cr sheath that was inscribed with brief descriptions of the contents of 
the tablet itself (Carlson, 1963 [ 10 1] ; Hessel, 1955 [268]; Lalley 1962 [343]; Olney, 

1963 [458]; Schullian, i960 [525]). The first known instance of an index list is 
apparently that of Callimachus in the third century B. C. , which was a guide to the con- 
sents of some 130,000 papyrus rolls (Olney, 1963 [458]; Parsons, 1952[469*J). 
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Bar-Hillel, 1962 [35], p.417. 

2 / 

Bohnert, 196 2[69]; Edmundson, 1959 [176]; and others. 
3/ 

Maron, 1961 [395], p. 404. 
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Application of the indexing principle by use o£ clerical procedures that today can be 
accomplished by machine was suggested a little more than a century ago. A British 
librarian, Andreas Crestadoro, advocated the permutation o£ the words in titles in 1856, 
claiming that thus the subject matter index v. wuM follow the author's own definition of the 
contents of his book. He prepared such "concordances of titles" for several different 
library collections. 1/ 

Within a generation, punched card machines had been invented, but they wer,; net to 
be used for library and documentation purposes for some decades yet. Keppel, 
writing in 1937 of his vision of the library 21 years in the future, says; 

"When it comes to using the cards, I blush to think for how many years we watched 
the so-called business machines juggle with payrolls and bank books before it 
occurred to us that they might be adapted to dealing with library cards with equal 
dexterity. Indexing has become an entirely new art. The modem index is no 
longer bound up in the volume, but remains on cards, and the modem version of 
the Hollerith machine will sort out and photograph anything the dial tells it . .."3/ 

By 1945, Bush had prophesied Memex[933, and in the 1950 Windsor lectures 
Ridenour referred to an RCA development, the so-called "electronic pencil", a proposed 
reading aid for the blind intended to convert printed characters to a suitable coded form. 
He went on to suggest; 

"... We shall have to arrange for cataloguing to be done by machine, without 
human interaction except in terms of setting up once for all the system on 
which the cataloguing is performed. . . It is only a step from this device (the 
electronic pencil) to the electronic catalogue, which will read text for itself, 
recognize key symbols and phrases with which it has been provided, and con- 
struct appropriate catalog entries for the text it reads. 

It has only been in the past decade or so, however, that there have been any serious 
efforts directed to the use of machines for automatic indexing. In the period 1957-1958, 
Luhn first presented and published several provocative papers dealing with such 
challenging possibilities as "auto-abstracting", "ai’.to- encoding" and "auto -indexing" 
(Luhn, 1957 [385]; 1958 [3743; 1959 [37l3 ). Luhn's work on the permutation of signifi- 
cant words in titles, abstracts, and complete text, the Keyword -in -Context or KWIC 



See Crestadoro, 1856 [1463; see also Farley, 1963 [1923; Linder, i960 [362]; 
Metcalfe, 1957 [4163; and Ohlman, i960 [45 13 . 

2 / 

See pp.19-22 of this report. 

3/ 

See Keppel, 1939 [3163, p. 5. 



See Ridenour, 1951 [5003, P- 
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system, also began about this time. — Also in 1958, Baxendale published the results of 
experiments in automatic indexing involving scanning of topic sentences, syntactical 
deletion processes and automatic phrase selection (Baxendale, 1958 [41] ). 



With respect to the KWIC and permuted title techniques, several independent 
approaches were being developed at about the same time as Luhn's. These concurrent 
efforts were carried out at the Wright Air Development Center (Netherwood, 1958 [437]). 
the Rocketdyne Division of North American Aviation (Carlsen, et al, 1958 [99]). and the 
System Development Corporation (Citron, et al 1958 [ 120] ; Ohlman, i960 [451] ). — 
Netherwood 1 s permuted title index to a bibliography on logical machine design involves 
manual simulation of a machineable method. Although the results were not published 
until June 1958, the manuscript was submitted in November 1957._ The Rocketdyne 
permuted- title bibliography, on industrial control, is credited by both Henderson (1962 
C 263] ) and Ohlman (i960 [451] ) as the first to be produced on computers, the program 
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In a private communication dated March 13, 1963, Luhn provided the following 

chronology: 

May 1957 Routine 1 Program for word isolation within 60 characters per card, 
written by H. C. Fallon. 

1957-1958 Creation of concordances of various scientific papers in the form of 

cards, each card showing a keyword centrally located within 60 letters 
worth of the associated phrase. Experimentation with these cards to 
arrive at thesauri for special fields of interest or study. Idea of auto- 
matic indexing by means of significant or keywords in context conceived 
by H, P. Luhn. 

May 1958 Keyword- in -Context Index for titles only initiated by H. P, Luhn and 

samples produced with Routine 1 Program. 

June 1958 Start punching of titles for Keyword-in -Context Index for literature on 
Information Retrieval and Machine Translation. (Keypunching done by 
Miss Olive Ferguson. ) 



August 1958 



September 

1958 



Simplified version of Routine 1 written by H. C. Fallon for generating 
Keywords -in -Context Indexes and delivered to Service Bureau 
Corporation, New York City. 

First Fdition of Bibliography and Keyword-in -Context Index on 
Information Retrieval and Machine Translation published by Service 
Bureau Corporation. 



January 1959 Started writing program for improved version of Keyword- in- Context 

Index, including derived identification code, written by Jr. J. Havender. 



June 1959 Second Edition of Bibliography and Keyword- in -Context Index on 

Information Retrieval and Machine Translation, published by Service 
Bureau Corporation, including derived identification codes. 

2 / 

See also National Science Foundation’s CR&D Report No. 3, [430], p. 39. 

U Netherwood, 1958 [437] , p. 155, footnote. 
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having been written by J. T, Madigan. At any rate, both this program and Luhn*s 
KWIC program at IBM were apparently written relatively early in 1958. 

Citron et al (1958 [ 120] ) in presenting results of the SDC work and Ohlman in his 
chronological bibliography of permutation indexing (i960 [45l])cite as at least partial 
predecessors the "rotated file" principles developed at the Chemical- Biological Coordina- 
tion Center (1954 [ 112]; Heumann and Dale, 195 7 [270] and 1957 [271]; Wood, 1956 
[649] ). It should also be noted as a matter of historical background that a system for 
machine manipulation and compilation of permuted title -and-term- index records has been 
in productive operation since 1952 . This earlier effort was not generally known to 
other investigators and was apparently first reported in the open literature as late as 1961. 



Notwithstanding such other efforts, it is conceded by almost all workers in the fields 
of automatic abstracting and indexing that the major credit for pioneering interest and 
impetus should be attributed to Luhn and Baxendale* Specific acknowledgements of their 
"pioneering work" and "first steps" have been made by many investigators both in this 
country and abroad--for example, Borko and Bernick, .2/Hines, Mooers, 2/ Pevzner 
and Styazhkin, — / and Wyllys.Z/ In particular, the Russian investigator Purto states; 

"So far as we know H. P. Luhn was the first investigator to suggest the concept of a set 
of significant words for the consideration of problems in automatic abstracting. " 



Much of the early effort 1957-58, whether at IBM or elsewhere, was in fact spurred 
on by the International Conference on Scientific Information (ICSI) held in Washington, D. C. , 
in November, 1958. The printed text of both the Preprints [478] and the final 
Proceedings [480, 481] was deliberately prepared, over the typographer's objections, 
so that a double space followed each period ending a sentence, in order to facilitate 
machine processing of this text. Thvs the printers ". . . . were faced with . . . the 
necessity to prepare the final volume of the Proceedings from these preprints, and to 
arrange type composition amenable to computer analysis. The latter is an experiment. 

With an eye to the distant future, the Program Committee wished to make available the 
monotype punched tapes from the text for statistical studies with computers. We hope 



If 

Carlsen. et al, "information Control", 1958 [99], p. 20. 

2 / 

Veilleux, 1962 [ 624] , p. 81; "Consumer demand balanced against availability of man- 
power and machine time were the factors which led to the establishment of the per- 
mutation title word indexing project in 1952. " 

3_/ 

Borko and Bernick, 1962 [77] , p.3. 

4/ Hines, 1963 [2731, p 7. 

5/ Mooers, 1963 L424] , p.4. 

6/ Pevzner and Styazhkin, 1961 [472] , p.3. 

7/ Wyllys, 1961 [650] , pp. 6-7. 

8/ Purto, 1962 [484], p. 2. 
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some work of this kind will be demonstrated during the Conference. This has caused some 
compromises in typography. . . 

Several pioneering experiments in automatic indexing were applied to this ICSI 
material. One of these led to the preparation of a permuted keyword index based on 
titles, subtitles, section and table headings, figure captions, and selected sentences or 
phrases taken directly from the text (Citron, et al, 1958 [ 120] ). It was prepared using 
punched card equipment, and the resulting listings were distributed to the Conference 
participants in November of 1958. Another set of experiments involved trial of the "auto- 
abstracting" and "auto-encoding" techniques proposed by Luhn (1958 [ 379 ] ) . A 
computer program potentially applicable to certain ancillary operations which might be 
involved in automatic indexing was also demonstrated at the time of the ICSI sessions. 
(Stevens, 1959 [568J ). 

Much of the rapidly proliferating work in the field of automatic indexing since that 
time has been inspired directly or indirectly by the results of these experiments using 
the ICSI material. For example, Dowell and Marshall, discussing early efforts at the 
English Electric Company, state: "We first became interested in the possibilities of 
computer produced indexes through Luhn's work at IBM and the early examples of KWIC 
indexes which were distributed at the time of the Washington Conference. . . " (Dowell 
and Marshall, 1962 [ 159] ) 



U 

"Preprints of papers of the International Conference on Scientific Information, " 

1958, [478], Preface. (The monotype tapes are in fact still held in the custody of 
the Research Information Center and Advisory Service on Information Processing, 
National Bureau of Standards, but difficulties to be discussed later in this report 
discourage their use.) 

2 / 

See also his "Automated intelligence systems" 1962 [372], note.ll, p. 100: 
"Papers for this conference were distributed to participants two months ahead for 
study. By arrangement with the Columbia University Press the Monotype tapes used 
in publishing these preprints were made available for experimentation. At the 
conference exhibit, IBM researchers demonstrated the automatic transcription of 
these Monotype tapes to magnetic tape via punched cards and thence the automatic 
creation and printout of abstracts by means of electronic data processing equipment 
cit the Space Systems Center in Washington, D. C. All this was done without any 
human intervention except for the handling of the input and output records . Also, 
preprinted Aiito- Abstracts of Papers of Area 5 of the Conference were made a* ail- 
able to participants at the beginning of the conference." 

3/ 

See also R. A. Kennedy, 1962 [310l, p. 181: "While automatic indexing in any 
interpretative and analytical sense is therefore not yet a practical matter, a 
simpler mode of machine indexing is coming into wide use ... primarily 
stimulated by the publication in 1958 and 1959 of reports by Ohlman, Hart and 
Citron and Luhn. " 
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A somewhat premature attempt was made to establish a subscription service for 
KWIC indexes for a number of journals, for initial distribution beginning January 1, 

1959. 1 / Called PILOT (Permutation Indexed Literature Of Technology), the proposed 
service was advertized as n a revolutionary new totally cro ss" referenced index . . . and it 
will be produced at the speed of light 1 '. Figure 1 is a reproduction of a part of the 
brochure issued in 1958 by Permutation Indexing, Incorporated, Sol Grossman, President, 
Los Angeles. While, perhaps unfortunately, the number of subscription orders received 
was not adequate in terms of the ambitious coverage planned, work on permuted title 
indexing elsewhere did lead rapidly to the publication of such indexes on a production 
basis. 

As of February 1964, there are more than 40 examples of KWIC and other variations 
of permuted keyword indexing techniques in productive operation or available to the 
searcher. KWIC-type techniques have also been extended to special one-time index com- 
pilations and other applications, as in "automated content analysis" of verbal protocols of 
psychiatric interviews and group leadership training sessions (Ford, 1963 [1983; Hart and 
Bach, 1959 C256]; Jaffe 1962 [294] and 1958 [296] ; Stone, et al, 1962 [575]). 

The same period during which the ICSI was planned and held (1957-1958) was also 
marked by the first issue of Current Research and Development in Scientific Documenta- 
tion b y the National Science Foundation. In it and in subsequent issues, there were 
reported other early efforts in machine- compiled indexes, in the construction and use of 
special thesauri, and in indexing and retrieval experiments based on machine processing 
of text. Thus, for example, punched card methods for compiling printed indexes and 
announcement lists were under consideration at Bell Laboratories and at Esso Research 
and Engineering. Special attention was being given to thesauri as early as July 1957 at 
both Chemical Abstracts Service and the Cambridge Language Research Unit, and 
at Ramo Wooldridge, "Research on the problems of fully automatic indexing and retrieval 
based on raw text input to a general-purpose computer is under way. "?_/ 

Nevertheless, as of the present date, the question of the possibility of automatic 
indexing in the sense of the substitution of machineable procedures for human intellectual 
efforts normally required to identify, categorize, classify, index, select, and list 
particular items in a collection of items is still moot. Opinions run the gamut from 
extreme pessimism, "Mechanization of abstracting and indexing is rejected as impracti- 
cal for the foreseeable future" — to enthusiastic optimism, "The conclusion that automatic 
indexing and cataloging is superior to human indexing and cataloging is both provocative 
and remarkable. " 

Borko and Bernick claim that "... Raw data, i. e. , unedited natural language text, 
can be processed statistically so as to automatically assign index terms to each document 
and to classify the document into a subject category; this has been demonstrated. " \l On 
the other hand, Farradane thinks that any form of mechanized processing in indexing 



1 / 

See Linder, i960 [363], p. 99 and Figure 1. 

2 / 

National Science Foundation^ CR&D Reports No. 1. [430]pp.4, 6; No. 3 [430] , 
pp. 12, 19> 31. 

3/ 

— Bar -Hillel, 1958 1 33] , abstract. 

4/ Swanson, 1962 [584] , p.468. 

$J Borko and.Bernick, 1963 [78], p. 28. 
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operations is "liable to continuous error ", — while Baxendale takes a middle ground: 
"Thus far the role of the computer is chiefly that of research instrument; whether or not 
it can fully assume the task of indexing is still in doubt". — ^ 

1. 2 Scope of This Study 

In view of the continuing controversy over the feasibility and evaluation of automatic 
indexing techniques, a state-of-the-art survey and report is perhaps premature at this 
time. The topic is controversial on at least five grounds: First is the question, "Can 
indexing be done by machine at all?" Next,"Is what can be done by machine properly 
termed 'abstracting', 'indexing', or 'classifying'?" The third moot point is "Is whatever 
can be done by machine good enough, acceptable, as good as, or better than the product 
of human operations?" The fourth and most critical question is "How can we evaluate 
acceptability or comparability for any indexing process whatsoever, whether carried out 
by man or by machine or by machine-aided manual operations?" Finally, "If an indexing 
product is to be achieved by machine, can it be done by statistical means alone, or must 
syntactic, semantic and pragmatic considerations be brought to bear in the machine 
decision-making processes?" 

The heat of controversy over any of these five grounds of debate is almost inversely 
related to the availability of objectively validated evidence to which appeal might be made. 
Thus, the literature on the topic to date is typically colored by personal reactions both 
pro and con, and even the cynics rely more on subjective judgments and personal pre- 
ferences than on any substantial body of daca. O'Connor cites typical claims of both pro- 
ponents and opponents of the feasibility of automatic indexing, and he comments on both, 

"I have seen no good evidence offered in support of such a conclusion. " 3/ 

An impartial middle ground is offered by recognition that "To define a process 
ordinarily thought to require human intellectual effort in such a way that it can be per- 
formed by a machine imposes a .rigor and a discipline on the definition which itself is in- 
valuable to understanding the nature of the process"-^ Learning more about the indexing 
process itself, through experimentation with machines, will provide "results of general 
interest, not just to those optimistic about machine indexing experiments". *>/ In this 
sense, a state-of-the-art study is not premature. In this sense, therefore, we shall 
ejqplore the five questions listed above in subsequent sections of this report. 



1 / 

Farradane, 1961, Cl93], p. 236. 

2 / 

Baxendale, 1962 [42], p. 69 . 

3/ 

O'Connor, 1961 C447], pp-274 and 275. 
4/ 

Swanson, 1962 [583], p. 288- 
5/ 

Bohnert, 1962 Q 69 ], p. 9* 
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More particularly, in this survey of automatic indcring efforts, we will be concerned 
with the following principal topics: 

(1) A brief indication of the variety of ways in which punched card machines and 
computers can be and have been used in the preparation or compilation of 
indexes. iJ 

(2) A more detailed consideration of the possibilities for machine generation of 
indexes, specifically including: 

(a) Automatic derivative indexing, as in various examples of machine 
extraction of keywords, where selection is based upon pre- specified 
criteria, 

(b) Automatic assignment indexing, whereby the machine is programmed to 
determine, in accordance with various specified criteria, whether or 
not some one or more members of an established list of 'labels 1 (such 
as subject headings, class names, descriptors, or other indexing terms) 
should appropriately be assigned to the document or item in question, and 

(c) Automatic classification techniques, on which such assignment- indexing 
operations may or may not be based. 

(3) Consideration of the use of machines as relatively sophisticated aids to human 
intellectual operations applied in either subject- content analyses or search- 
strategy determinations. 

(4) Discussion of the question of evaluation of any index whatever, whether 
manually or mechanically prepared. 

(5) Consideration of the implications of related research and development efforts, 
specifically including: 

(a) Comparative evaluation of indexing systems, 

(b) Development and use of new types of "indexing" aids (in the sense of 
"pointing to" and "indicative of" the probable subject-content relevance) 
to either selective dissemination or retrospective search of the technical 
literature, 

(c) Linguistic and logical -inference approaches to the elucidation of 'meaning' 
in natural-language messages, and 

(d) Theoretical approaches to the problems of determining "membership -in- 
classes". 



Note that card- controlled camera systems, such as the Listomatic, and Addresso- 
graph machines have also been used for index compilations. See, for example, Shaw, 
1951 [542], p. 49, who cites early use of the Addressograph for bibliographical work 
by A. Predeek, "Die Adrema-Maschine als Organizationsmittel im Bibliotheks- 
betriebe", Berlin, 1930. and E. Morel, "Les Machines au secours de la Biblio - 
graphic". Revue du Livre 1:14-19 (1933)-Useof such devices is not included in this 
report, however, since they cannot be adapted to machine generation of indexes. 



(6) Appraisal of the current prospects for further research and development. 

Certain diffi culties of organization are evident. Thus many proposals precede actual 
tests of techniques to which they are akin. Other proposals have been engendered as by- 
products of or incidental to investigations of other techniques , such as those of text pro- 
cessing to derive by machine selected sentences which together may serve as automati- 
cally generated "abstracts", more properly extracts. U 

This related subject of automatic abstracting, i. e. , the application of machine - 
usable rules to the extraction or generation of textual information representing in con- 
densed form that carried in the document as a whole, will not be of primary concern. 
However, it will be noted that most of the automatic abstracting techniques so far pro- 
posed are potentially usable as tools for automatic indexing, esp xiaUy in the trivial 
sense that the automatic selection of index terms could be based solely upon the substan- 
tive words found in the machine -prepared extract. Further, since we are presuming 
that a state-of-the-art review of automatic indexing techniques is in some sense appro- 
priate at this time, we shall emphasize the actual results of machine compilation and 
machine generation of indexes and those investigations of assignment- indexing techniques 
for which experimental or comparative data have been reported, rather than theoretical 
approaches. 



1 / 

See, for example, Luhn, 1959 [384], p. 4: "The principle of abstracting in- 
formation by extracting certain portions or elements from the full text of a 
document is particularly suitable to mechanization"; Becker, i960 [44], p. 13: 
"Perhaps 'extracting 1 would have been a better word than 'abstracting'"; Edmundson 
and Wyllys, 1961, [181], p. 227: "All proposed methods for making an automatic 
abstract of a document involve using the author's own words by selecting complete 
sentences, thereby reducing abstraction to the simple task of extraction." 

2 / 

See Wyllys, 1963 [653 1 p. 22: "Automatic indexing is an area that seems 

to us to be especially close to automatic abstracting, since the words and word 
groups found to be most representative of a document for automatic abstracting 
purposes are obvious candidates for entries in an automatic index for the 
documents." See also Tanimoto, 1961 [594 ] , p. 235: "Thus after ex- 

tracting k sentences which are a predetermined small fraction of the document, 
we have an 'abstract'. To find the indexes to the document we take these k 
sentences and the corresponding sets of the canonical elements and consider 
terms versus sentences instead of sentences versus terms. . . The same analysis 
is then applied to this 'transposed' problem to produce the index terms"; Yakushin, 
1963 [ 654] , p. 17: "If some method can be employed for the automatic compilation of 
abstracts, it can as well be used for the subject index. " 
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1.3 Derivative vs. Assignment Indexing 



At least part of the provocation and controversy with respect to the possibilities for 
the use of machines in indexing U due to confusion as to what type of indexing is meant. 

This in turn relates to a much older and broader controversy--that between "word" or 
"catchword" indexing on the one hand and "subject indexing", "concept indexing", or 
"controlled indexing" on the other. 

In terms of operational definition, the contrast is best expressed in Luhn's dis- 
tinction between index entries that are derived from the text of an item itself and those 
that are assigned to it from a list or schedule of subject categories, descriptors and the 
like, which exists independently of the text of the item (Luhn, 1962 [372] ). 1/ In general, 
the differentiations that are made for the broader controversy, and the claims and 
counter- claims made by the enthusiasts of either school, provide background for the 
distinctions that should be made between various automatic derivative indexing operations 
and whatever possibilities may be demonstrated for assignment i ndexing by machine. 

In his text on information storage and retrieval Kent (1962 [3151 ) contrasts word index- 
ing as used in permuted keyword indexes, concordances and "pure" Uniterm systems with 
controlled indexing which "implies a careful selection of terminology used in indexes in 
order to avoid, as far as possible, the scattering of related subjects under different 
headings." He notes elsewhere that word indexing requires little subject-matter training 
on the part of the indexer and little skill in indexing as such, and adds: "It is this type of 
indexing that a machine can perform well.’l^/ 

Like Kent, Bernier thinks that true subject or assignment indexing requires highly 
trained human indexers. He says further: 

"The difference between subject and word indexing has been unclear at times. 

Both types employ words, but only true subject indexing employs them with 
discrimination. Word indexing leads to omission of entries, scattering of re- 
lated information, and a flood of unnecessary entries. Word indexing uses 
words as they are found, in the material indexed with a minimum regard for 
standardized meaning. . . " 3/ 

Hemer provides a further amplification of differences that are pertinent to con- 
sideration of indexing by machine, as follows: 



y 

See also Hemer, 1962 [266], p. 5; Skaggs and Spangler, 1963, [557], p. 60; Slamecka, 
1963 [558], p.224. Mooers makes a similar distinction between "index terms which 
are words or phrases extracted from the text and stylized conceptual terms --cliches 
-- which are assigned to the text" , 1963 [423], p. 4. 

2 / 

Kent, 1962 [314], p.268. 

3/ 

Bernier, 1956 [54], p. 23. 



4 / 



Hemer 1963 [267], p. 183. 



"The differentiation that is made between the two types of indexing is that word 
indexing is inextricably tied to the words in a text; If a word appears it gets 
indexed as such; if it does not appear it does not get indexed. Concept index- 
ing, on the other hand, has an element of abstraction in it; Words may either 
be indexed as such or may be converted, either by themselves or in combination 
with other words, into concepts which may not bear a direct resemblance to 
the words or combinations of words that evoked them in the indexer's mind. " 

Machine techniques such as those of Luhn's KWIC, like the early Uniterm systems, 
look no farther than the words used by the one author himself. Techniques such as those 
of Maron, Swanson, Borko, Meadow and Williams, among others, look specifically to 
relationships between words as used by one author to patterns of word usages in a given 
subject area or given document collection. They may also look to these patterns as in 
turn related to prior human analytic judgments of the "aboutness" referrents of items in 
'he collection. In this sense, they at least attempt replication by machine of assignment 
indexing. 

There is no real question but that machines can in fact derive words from text pro- 
vided that it is in machine- readable form. This machine procedure may involve direct 
extraction of all words as index entries, as in a complete concordance. It may involve 
the extraction of only those words which survive a "purging" operation in which articles, 
conjunctions, adjectives, and other "common" words are first deleted. Various machine 
controlled modifications to such "derivative" indexing are also available. The case for 
machine achievement of assignment indexing for any but limited special cases is not so 
clear. 



2. INDEXES COMPILED BY MACHINE 

A first and obvious use of machines in indexing processes is in the manipulation of 
index entries, previously selected on the basis of human analysis, to produce various 
orderings, duplications and listings of these entries. The power of machine techniques 
to speed and economize the sorting, ordering and listing operations in the preparation 
or compilation of indexes was recognized quite early, both in the field of library science 
and in the consideration of potential areas of application by specialists in machine 
potentialities. 

In particular, two specialized types of index, at least in the broad sense, are such 
that their compilation would be almost prohibitive in terms of time and cost were it not 
for the use of machines. These are, respectively, the case of the complete index, the 
index to all words of a text in their various contexts, which is a concordance, and the 
case of the " citation index" , which has been used in the field of law for many years but 
has only quite recently been suggested for literature search purposes related to 
scientific and technical information. 



See, for example, Doyle ,1963 [l62l,.jp. 11: "Without data -processing 
machinery, concordances are prohibitively expensive to generate for most uses 
except in those cases where it is well known that a given volume of text is going 
to be used again and again, by large numbers of people over a long period of 
time. As we know, clergymen have made use of manually prepared concordances 
of the Bible since the 12th century". 
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In machine- compiled indexes, no item or entries are eliminated by the machine, 
whereas in even the most rudimentary of machine -generated indexes, such as K'VIC, 
various reductive or extractive operations are automatically applied as a part of the 
macnine procedure. We shall be concerned in this section with brief discussions of 
machine -compiled indexes and related devices, specifically, concordances, card or book 
catalogs mechanically prepared, citation indexes, and special indexes such as Tabledex. 
The use of machines to compile, sort, duplicate and list index entries can onl> be con- 
sidered to be mechanized indexing in a relatively trivial sense. We shall consider, there- 
fore, only a few representative examples, emphasizing early work and some of the 
pioneering instances. 

2. 1 Concordances and Complete Text Processing 

When as early as 1856, Grestadoro proposed the use of permutations of the words in 
titles as a subject- content index the only ’’machines" available for the processing opera- 
tions were people acting in a strictly clerical way. Precisely such clerical operations 
have beer, used for centuries in a process that is, in th^ special sense of full representa- 
tion of document contents, an index- producing operation--the making of concordances . ]J 
The task of listing each separate word in a book in all the contexts in which it appears 
is incredibly time-consuming and tedious when carried out by manual means. There are 
those who have spent the major part of their lifetimes at this task. For example: ”It ^ . 
took James Strong thirty years to compile his exhaustive Concordance of the Bible. . . " — 
The use of machines capable of processing signals which represent and preserve in- 
formation offered a potentially revolutionary change, and with the advent of the electronic 
computer even more radical possibilities of very high speed processing were opened up. 

As early as 1949, J* W. Mauchly (the co-inventor of ENIAC and UNIVAC) envisioned 
the use of computers for documentation and library science activities. He suggested that 
the full information contents of the Library of Congress collections could be recorded in 
machine language, stored in this form on magnetic tape, and searched by machine in a 
procedure which would match words or other selection indicia occurring in the recorded 
information to the specified words or selection criteria of a query or search prescription. 
.Specifically, he estimated that the entire collection, then amounting to 10,000,000 books, 
could when transcribed to binary-code representation 3 / be serially searched in 20 
hours. ,£/ 

T7 

See, for example, Black, 1962 [65], p.314: ’’The oldest book in the world has had 
such an index for many years--the concordance to the Bible;" Markus, 1962 [394], 
p. 19: "The ultimate in permutation for indexing is a published concordance;’’ Linder, 
I960 [363], p. 99: "We know of a concordance prepared in the 13th Century;” 

Simmons and McConlogue, 1962 [ 555] , p. 3: "Complete indexing has been used of 
course for centuries in the preparation of concordances. " 

2 / 

Carlson, 1963 [101], p.211. 

3/ 

That is, markings which have one of two values (thus, binary digits or ’’bits"), can 
be used to distinguish between 2 n different other symbols such as alphabetic 
characters by using log 2 n of such markings. A binary code for the 26 Inters of the 
English alphabet requires a five-bit representation for each letter. If numeric digit 
characters are also recorded, (26+10), a six-bit code representation is required. 

4/ 

Mauchly, 1949 [406], 0.295. See alsc "Report to the Secretary of Commerce on the 
application of machines. . . " 1954 [b20], p. 67. 
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Mauchly’s suggestion was, in effect, the idea of a complete index that could be 
searched by machine. We should note, however, that although subsequent technological 
advances could significantly decrease his original time estimate, the crucial questions 
that remain are those of what, assuming one-to-one representation of document text, one 
would search for. 1 / Natural language searching by machine, in the sense of full text 
inspection, is a "pay-as-you-go" concordance technique. It is, however, a technique 
which must be aided and abetted by various forms of synonym reduction, syntactic 
normalization, homograph resolution and other special processing operations if it is to be 
in any sense an effective tool for selection of clues to be retrieved. 

Gardin, in a series of recent lectures on automatic documentation, (Gar din, 1963 
[207, 208] refers to the opinions of some investigators that it should be possible to 
"jump" the stage of indexing and to search the natural language texts directly. The 
* *oblem, he points out, then shifts to the determination of all the various ways in which 
tne possible answers to a question may have been expressed in these natural language 
"complete indexes". Instead of carrying out reductions or condensations of the documents, 
as in normal indexing procedures, amplifications of questions are required. "Reductive" 
indexing of the source documents can only be eliminated at the expense of "expansive" 
indexing of questions. Gardin concludes that the gain from this is very doubtful. 

There is also the presently staggering burden of time and cost to convert full texts to 
machine-usable form. As of February, .961, it was estimated that the natural language 
text material available for machine processing amounted to little more than the words 
contained in the Harvard Classics five- foot shelf (Stevens, 1962 [567] ). Perhaps up to 
ten times that amount is now available, notably in the 6, 000, 000 words of the statutes of 
Pennsylvania 3 / and in several million additional words that have since been keypunched 
at the Center for Automation of Literature Analysis, Gallarate, Italy. 4/ A very recently 



y 

See, for example, Yngve, 1959 [657], pp.978-979: "We will have to find formal 
connections between widely divergent ways of saying essentially the same thing. In 
addition there is much that we will have to learn about searching. If we had today a 
complete grammar of English which was capable of rendering explicit all relations 
and distinctions implicit in the document, I doubt that we would know how to use it 
effectively in a machine search situation. We would be embarrassed by the very 
wealth of the information available. Much more must be learned about search 
situations. " 

2 / 

See also Bar-Hillel, 1962 [35] , p. 415: "Could not the stage of clue assignment be 
completely skipped and th request topic be directly compared with the original 
documents? It is very natural that such a thought should have arisen, but it must 
be stressed that there is nothing in our knowledge of the workings of communication 
which would indicate that such a proposal is, or ever will be, practical. " 

3/ 

See various references by J. F. Horty, W. B. Eidridge and S. F. Dennis, E. M. Fels, 

R. Wilson. 

4 / 

R. Busa, data reported at the NATO Advanced Study Institute on Automatic Docu- 
ment Analysis, Venice, July 1963. 



completed smdy made by the TRW Computer Division, Thompson Ramo Wooldridge, 
involves the investigation of the possibilities for a center to provide text in machine- 
usable form. The report gives a total figure of approximately 50, 000, 000 words of text 
so available as of February 28, 1964, but this includes non-scientific text, such as news- 
paper and popular magazine materials (Mersel and Smith, 1964 [415] ). 

Mersel and Smith also report on the estimated requirements for machine -usable text 
for various research groups, averaging over a million words per year per group. Yet, at 
present keypunching costs of one cent or more per word, is it reasonable to assume that 
any of these research groups can provide a budget of over $100, 000 per year for this 
purpose alone? Moreover, this budget would provide for the conversion of no more tha-n 
a thousand 1, 000-word items .or a hundred 10, 000-word items at costs, respectively! of 
$100 or $1,000 per item. For the present, therefore, the conclusion is inescapable: either 
indexing or search based upon full text processing is not yet practical. Even the most 
enthusiastic proponents of "searching full natural language text" (Swanson, i960 [ 589]) 
and 11 maximum- depth indexing 1 '(Simmons and McConlogue, 1962 [ 555] ) generally agree as 
to the present impracticality of full-text mechanized indexing except for special limited 
cases. 

The two problems of determining what to search for, given full text, and of feasibility 
of conversion of text into machine -usable form thus combine to limit "complete indexing" 
largely to the special cases of providing corpora for studies in the field of computational 
linguistics and of compiling the traditional scholarly tool- -the concordance to all the words 
in a given literary work or works. Apparent exceptions, including experimental work 
with abstracts only and the law statutes studies, are usually cases in which the selective 
principle of disregarding common words (and hence the bulk of the actual text) is applied 
automatically either on input or in subsequent processing (Cleverdon and Mills, 1963 
[131] ). These cases, therefore, may be considered machine-generated indexes rather 
than machine -compiled. Moreover, it should be noted that: 

11 The lawi itself, is an appropriate field for data retrieval. The statutes, 

especially, are written in relatively clear, concise language. At least, this 
is their intent. Practically, this means that input and output can both be 
relatively short and that retrieval of legal information will be involved with 
fewer semantic difficulties." 

In the area of concordance -making, however, the potentialities of machine com- 
pilation have been put to good use. The pioneer efforts in this area are unquestionably 
those of Father Roberto Busa, S. J. , of the Gallarate Center. As early as 1946, Busa 
proposed- to his superiors that a card file recording all the words used in all of the works 
ci St. Thomas Aquinas should be set up, and he began his actual experiments using IBM 
punched card equipment in 1949 (Busa, 1953 C87], i960 C 9 1] , and 1958 [ 92] ; Secrest, 

1958 [ 540] ). 2/ Appearing in 1951, his Sancti Thomas Aquinatis Hymnorum Ritualium 
Varia Specimina Concordantiarum is the first known example of a complete word index 
that was compiled by machine techniques. The early Gallarate work was carried out on 
standard punched card equipment, but from the time of the concordance to the Dead Sea 
Scrolls, computers have also been used (Tasman, 1959 [595], [596], and [597]). The 
major continuing task is still to other works of St. Thomas. Other machine -compiled 
concordances produced by Busa’s Center include one to Goethe’s Farbenlehre , Bd. 3. 
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Asher and Kurfeerst, 1963 [24], pp.1-2. 
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See also Scheele(ed. ), 1961 [522], pp*206-209. 
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Other relatively well-known examples of machine- compiled concordances include 
those to the Revised Standard Version of the Bible (Ellison, 1957 C 1 86]; Cook, 1957 [ 139] ) 
and tc Matthew Arnold’s poetry (Pair.ter, i960 [461 ]; Parrish [467, 468] ). The Cornell 
Concordance Series, under the general editorial supervision of Parrish, includes in- 
vestigations of Old English, such as The Anglo-Saxon Poetic Records (Bessinger, 1961 
[59] ). 

The November 1962 issue of Current Research and Development in Scientific 
Documentation , No. 11, [430], lists several concordances compiled by machine including 
the work of Sebeok [533, 534] and associates at Indiana University on Cheremis folksongs, 
the work on the National Vocabulary of the French language under Quemada at the 
University of Besancon, — the preparation of glossaries and concordances to the works of 
Kant at the University of Bonn , and concordances to medieval German texts being 
compiled by Wisbey at the University of Cambridge (Wisbey, 1962 [646], [647] ). At the 
University of Gothenburg in Sweden, work has begun on mechanical linguistic analysis of 
English language texts, using the machine- readable teletypes elter tapes used for the 
printing of paperback books (Ellegard, i960 [184] and 1962 [185] ). — ^ Another recent 
example is that of the work at the Summer School of Linguistics, University of Mexico 
(Grimes and Alvarez, 1961 [243] ). By 1963, Marthaler writes that "Compiling con- 
cordances with the aid of a computer is already standard routine to such c-n extent that 
it needs hardly be described in detail." As of January 1964, a general-purpose com- 
puter program for the IBM 7090 which can compile various types of concordances has 
been announced as available from the Mechanolingui sties Project at the University of 
California. (1964 [95] ). ® \L 

The major advantage of using machines to compile concordances is, of course, the 
enormous difference in the time required to complete the work. Thus# only 120 hours 
were required on the UNI VAC computer to prepare the 800, 000 words of the Concordance 
to the Revised Standard Version of the Bible (Cook, 1957 [139]; Ellison, 1957 [186] ).AZ 
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See "Actes du colloque sur le mecanisation. . . ", 1961 [ l] ; Quemada, 1961 [485] and 
1959 [486]; Centre d'Etude du Vocabulaire Francaise, "Specimens de Travaux 
lexicographiques. . . ", 1960 [ 106] . 
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"California Concordance Program Available", 1964 [95] 
y Carlson, 1963 [101], p.211. 



In the use of the IBM 705 for the concordance to the Summa Theologiae, Fr. Busa reports 
that only 60 hours were required to arrange in alphabetical order 1,600,000 words, "jj 
This advantage of speed, with the concomitant benefits of both economy and timeliness, is 
illustrated by Tasman as follows: 

". . . It has been estimated that it would take 50 scholars 40 years. . .to manually 
index the 13 million or so words of St. Thomas Aquinas' complete works. IBM 
punched card machines would produce the indexes and concordances much more 
accurately and would take ten scholars about four years. Large-scale data 
processing techniques would reduce the time to about 25 percent. . . (or). . . ten 
scholars to do the job in less than a year. " U 

Other advantages stem from the facility with which further machine processing can be 
introduced. Once the text is in machine -readable form, a number of valuable byproducts 
can be derived. Examples are statistics on the number of words that have 2, 3, ... n 
letters, frequencies of letter usage; printouts. of occurrences of specified words or groups 
of words; and lists alphabetized on terminal rather than initial letters. Added advantages 
of computer processing are further exemplified in the options available with the California 
concordance computer program (1964 [95]), some of which are as follows: 

(1) The user may obtain a restricted rather than a full concordance by supplying a 
list of words for which no* entries are to be made. 

(2) The user may obtain a selective concordance by supplying a list of words for 
which, and only for which, entries are to be made. 

(3) Each entry word may be centered with its preceding and succeeding context, 
up to the limits of one full line of 131 characters, or each entry word may be 
listed together with the full sentence or verse in which it occurs. 

(4) Text with interlinear information such as grammatical symbols can be used and 
selective concordances can be compiled on the basis of such interlinear 
information. 

(5) The citations of an entry can be listed in order of textual occurrence, in an 
order determined by preceding or following words in its context or in an order 
determined by accompanying interlinear symbols. 

2.2 Card Catalogs, Book Catalogs, Bibliographies and Subject Index Listings 
Prepared by Machine 

The use of machines such as punched card equipment for the preparation and pro- 
cessing of library .ard catalogs and of index listings was advocated by a few far-sighted 
documental ists at least as early as the 1930's (Parker, 1938 C 463 j ; Dewey, 1959 [153]). 
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See his statement in Scheele, 1961 [522], p. 209. 
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Tasman, 1958, [596] , p. 11. 



McCormick’s bibliography on mechanized library processes {1963 [407] ) lists a number 
of early suggestions, notably those of Fair in 1936 [187], Shera in 1938 [547], and Gates 
[225] and Callander [ 96, 97# 98] in 1946. Cox, Bailey and Casey proposed the use of 
punched card equipment for the preparation of bibliographies in the field of chemistry in 
1945 [142]. 

By 1946, Gull claimed that: 

". . .Punched cards and present equipment offer new possibilities right now for 
solving the problems of the indexes to Chemical Abstracts. These indexes are 
large undertaking: in themselves, and the work of arranging, cumulating, and 
printing them can be simplified by placing the index information on punched 
cards at the time the abstracts are made. With, current indexes on punched 
cards, two or three cumulations of the author index during the year will greatly 
reduce the work required in using current issues from that approach. Cumu- 
lations of the subject, patent, and formula indexes immediately become possible 
for intervals more frequent than once a year. 11 [ 245] 

The following year (1947) saw a summary by Gull of potential applications of punched 
cards in special libraries [247], and Becker surveyed some of the then discernible 
prospects for library mechanization, as a student in the Library School of Catholic 
University. He stressed such advantages as flexibility in the processing of new material 
for abstracting, indexing, filing, and interfiling purposes and the printing out of various 
listings in any format. — ' 

The potential use of machines for library science and documentation had not actually 
been recognized, however, for many years after the invention of punched card equipment. 
Both the punched card developments (beginning with Hollerith and Powers in the 1880‘s) 
and the electronic computers developed from 1946 onward were first applied to the auto- 
matic manipulation of information in the sense of statistical, mathematical, or engineer- 
ing data, rather than to information about data or information about other information. 

Dr. John Shaw Billings, himself a librarian of. note, was apparently the first to suggest 
to Herman Hollerith the idea of recording information as holes punched in cards which 
could then be sorted mechanically. Larkey comments: "It is not known if Billings ever 
thought of applying the principle to bibliographic work, but it would seem eminently 
fitting that it might be so utilized. " kJ 

Larkey himself as head of the Army Medical Library Research Project at the Welch 
Medical Library, Johns Hopkins University, was certainly one of the pioneers in such 
utilization, but this was almost 70 years from the date of the Billings-Hollerith 
conversations. The Army Project, begun in late 1948 or early 1949, had as its contract 
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Becker, 1947, [43], pp. 11 — 12: "From the flexible arrangement of the cards, 
bibliographies become readily available by subject, author, and title. In special 
libraries, where material on one subject is concentrated, the research possibilities 
of gathering, sorting, filing, and printing information are almost limitless. Con- 
tinuous machine interfiling permits keeping current with new entry additions. " 
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"With the masters. . . ", 1963 [ 648], p. 18. 

3 / 

Larkey, 1953 [35l], p. 34. 



objective "to explore existing and projected methods, emphasizing machine methods, 
applicable to such pilot projects as may be necessary" (Larkey, 1949 [348], 1956 [349], 
and 1953 [351] ). Also as of 1949, the Library of the Department of Agriculture is 
reported to have "conducted an experiment in the use of electronic data-processing 
machines to produce the author and subject indexes to the 'Bibliography of 
Agriculture*. 11 2/ 



It is not until the early l950*s, however, that punched card machine techniques were 
actively put to use for the preparation of card catalogs, book catalogs, bibliographies and 
various index listings. Then, a number of independent but largely concurrent applications 
were tried out on at least an experimental basis, including in addition to the work of the 
Welch Medical Library Project pioneering efforts in mechanized book catalog production 
(Griffin, i960 [242]; Martin, 1953 [ 400]; Berry, 1958 [58] ) and what is claimed to be the 
1 'first successful non- experimental punched- card catalog of periodicals", the Serial Titles 
Newly Received (now New Serial Titles) , as published by the Library of Congress from 
1951 onwards. 



The work at the Welch Medical Library continued for several years, the final report 
being issued in 1955 [234]. Beginning in 1951, the project maintained in punched ca*d 
ferm the subject heading authority list used for the Current List of Medical Literature 
(Larkey, 1953 [ 35l] ; Garfield, 1953 [217] and 1954 [220]." Garfield has stated that this 
WDrk "clearly demonstrated the ease of converting alphabetic subject heading lists to 
categorized or classified lists of terms by the use of punched card .equipment. "2/That is, 
each heading or subheading had assigned to it a numeric code reflecting its appropriate 
position in the classified system, which could then be used by machine for sorting, 
ordering and listing. Ingenious use was made of the IBM 101 Statistical Machine in the 
prepa ration of printed subject indexes (Garfield, 1953 [218] and 1954 [216]). Other 
subject heading lists maintained by punched card techniques by 1953 or earlier included 
those of the U. S. Patent Office and the Technical Information Division of the Library of 
Congress, 



The first loose-leaf printed book catalog to be produced by machine methods was 
apparently that of the King County Public Library in the State of Washington in 1951, and 
the following year the Los Angeles County Library inaugurated a similar system for the 
distribution of a master book catalog prepared by mechanized techniques (Berry, 1958 
[58]; Griffin, i960 [242]; Martin, 1953 [400]; Alvord, 1952 [4]). 



The work on mechanized preparation of lists of periodicals at the Library of 
Congress has been reported as follows; 

"In 1951, the Library began publishing, at monthly intervals. Serial Titles 
Newly Received. In 1953, its title was changed to New Serial Titles. . . 

Ever since its inception, the fundamental ingredient of the publication has 
been the IBM punched card. . . 



u 
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U.S. Congress, Senate Committee on Government Operations, I960[6l9], P-147. 
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"Two important advantages of the punched -card method were foreseen when the 
publication began. First, it would be possible to print lists from the cards at will, 
without any further editing or proofreading, once the information was in punched- card 
form. Second, there was the possibility of mechanically preparing special lists of 
titles, selected on the basis of subject, country, or language. " 1/ 

Thus, by 1953, "a number of instances of printed indexes prepared by machine" could 
be claimed. 2/ The ’..se of punched cards to sort, to prepare tabular listings for various 
drafts and revisions, and to interfile corrected or revised entries greatly facilitated the 
preparation at Battelle Memorial Institute of the subject index to the Proceedings of the 
International Conference on the Peaceful Uses of Atomic Energy, 1955 (Lipetz, i960 
[367]). 



Developments in the use of punched card machine techniques in bibliographic, opera- 
tions of these types, beginning in the 1950's, have by no means been limited to the United 
States. For example. Remington Rand punched cards have been used in the preparation of 
a national union catalog of Italian libraries, 1/ and Mikhailov reports for the All-Union 
Institute of Scientific and Technical Information (VINITI) as follows: 

"The development program for machine production of indexes has been underway 
at the Institute for a number of years. . . In fact, operational use of Soviet-made 
punch- card machines to compile the author indexes for some of the series of our 
Abstract Journal has been practiced at the Institute since 1957. " ^ / 

1 

In France, at the Centre d'Etudes Nucleaires, Saclay, a program has been developed 
for mechanization of the production of biweekly and cumulative indexes and for demand 
searches (Chonez, i960 [ 116, 117, 118]). 



With the advent of automatic data processing systems, the speed, the flexibility and 
the capability for multiple- purpose processing buttress the claim that the card catalog can 
be "replaced or supplemented by book catalogs made with the aid of mechanized equip- 
ment". — It is further claimed that "The printed catalog produced by means of automatic 
equipment combines the best features of the conventional card catalog and the traditional 
printed catalog, and adds to both new dimensions that would have been unbelievable a 
generation ago. " fd A joint project is under way by the Medical Libraries of Columbia, 
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Vertanes, 1961 [625], p.242. This is with reference to the LILCO Library Printed 
Catalog, which is prepared by sorting and processing information on titles, authors 
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Lighting Company. 
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Harvard} and Yale Universities for computer preparation of book catalogs for books 
published from 1960 onward (Kilgour, et al 1963 [324])* Another recent illustrative 
example of the production of printed book catalogs by means of computer compilation is 
that of the Boeing "SLIP" System (Weinstein and Spry, 1963 [6333). 

Along with recognition of computer -process :ng potentialities there has emerged 
increased awareness of the desirability of taking advantage of one-time recording of 
information to serve multiple purposes: the principle of by-product data generation. The 
advantages for the library and document collection are that a single recording of biblio- 
graphic information in machine -usable form can lead to a variety of products, specifically 
including printed book catalogs, 1/ recurrent and demand bibliographies, the requisite 
number of copies for conventional card catalogs, card catalog sets or catalog listings for 
the personal use of the individual worker, input to mechanized selection and retrieval 
systems, and machine-manipulatable data for such other purposes as circulation control. 

Turner and Kennedy report, for example, the initial use of a Flexowriter to prepare 
library catalog cards and the by-product generation, via a 1401 computer, of bi-weekly 
listings of unclassified report titles at the Lawrence Radiation Laboratory, the "SAPIR" 
System (Turner and Kennedy, 1961 [6153). Chasen discusses a change from a previous 
punched card system for circulation and recall at General Electric’s Missile and Space 
Division Laboratory to a combined Flexowriter and G. E. 225 computer procedure to 
provide mechanized retrieval, compilation of desk catalogs, compute r updating of 
catalogs and files, and the maintenance of subscription lists (Chasen, 1963 [1083). 

Fasana describes a system at the Air Force Cambridge Research Laboratory Library 
where typing indications in the tape are used as boundary codes. He reports: 

"Input tapes are currently being processed on a computer to automatically produce 
catalog card sets, circulation control records, and book form indexes. Original 
input tapes now being accumulated will form the basis of a machine -searchable 
file to be used in the future for more sophisticated printouts and searches. " 2/ 

For such applications, Durkin and White make the following typical claims: 

"The system described has permitted the IBM Command Control Center Engineering 
Library to produce its catalog cards and library bulletin both faster and cheaper. 
Since a by-product of this process is the preparation of all catalog information in 
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See for example, Olney, 1963 [4583, p. 42: "During the past few years a number 
of libraries have initiated a program of mechanization. . • by punching on IBM cards 
or paper tape some of the bibliographic information normally given on catalog cards. 
Recording this information in machine-readable form makes it very easy to prepare 
printed book catalogs. . . " 
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Fasana, 1963 [1953, p. 326. This system involves the "Machine-Interpretable 
Natural Format" and procedures developed for AFCRL by Itek Corporation; 
see also Lipetz et al, 1962 [ 3683* 



punched card form, it has also permitted the establishment of a circulation control 
system, the publication of overdue notices and reading lists, and the eventual 
institution of a computer retrieval program" (Durkin and White, 1961 C 1 73] ; White, 
1963 [638]). 

Heiliger reports for the library of the new Chicago Campus of the University of 
Illinois as follows: 

"The type of bibliography the computer can produce does make greater use of LC 
card information than do present card catalogs. With the computer programmed 
with a set of library filing rules and a set of symbols that describes for the computer 
the various parts of the bibliographic unit, it can print- out, for instance, a list of 
books published in a given country, between certain years, on a certain subject (or 
combination of subjects), that are illustrated and have bibliographies. It will also 
be possible to permute cn individual items in LC subject headings in the same fashion 
that Chemical Titles does on titles. This index has been dubbed POSH (permuted on 
sub j ect headings). " \! 

Some recent experimental work at Inforonics, Inc. puts major emphasis on by- 
product data generation, beginning with the actual preparation of manuscripts for publi- 
cation. Tape typewriter processing of manuscript for journal articles is being studied 
from the point of view of producing machine -us able text. This text, together with coded 
identification of the separate items in the text, is so prepared that computer programs 
can produce from the single -input automatic typesetting tapes for the article itself, 
author and subject index entries, and the like. Computer text transformations can also 
produce entries for citation indexes, abstract journals and search files (Lackland. 1963 
[83, 84]). 

Other computer-produced indexes or special indexes involving compilation rather 
than selection by machine include indexes to Nuclear Science Abstracts (Day and Lebow, 
I960 [l5l]), the Current List of Medical Liter ature (Chonez, i960 [116, 117, 1181), 
the Retrieval Guide to Thermophysical Properties Research Literature , U and the 
Research and Development Abstracts of the USAEC (Si errod, 1963 L 54l] ). At the 
Atomic Energy Commission also, a modification of this RDA computer progi <_m is used 
for author, corporate author, number and subject indexes for the ^ Engineering Materials 
List , which includes announcements of blueprints and drawings. In several instances, 
machine processing capabilities are used for permuted listings under various assigned 
indexing terms. Special cases of machine permutation operations involve compilation 
and organization of chain indexes, used to reflect the various key entries in faceted 
classification systems (Dowell and Marshall, 1962 [159]; Foskett, 1962 [199]; Olney 
1963 [458]). 
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A final special case of a computer- compiled index should be noted. This is the work 
of Schultz and Sherpherd with reference to the annual meetings of the Federation of American 
Societies for Experimental Biology (FASEB) (Schultz and Shepherd,' i960 [5323; Schultz, 
1963 [ 527] ; Shepherd 19631545J". 1 / The indexing terms are generated first by the authors 
of the papers but are then run against a computer program, which by thesaurus -type look- 
up eliminates synonyms and supplies syndetic devices in addition to formatting the subject 
index for printout. 

The machine- readable thesaurus developed for this project presenlly performs the 
following four basic functions (Schultz, 1963 [527]): 

1. It accepts words from titles and indicia supplied by the authors without 
modification if they match acceptable indexing terms. 

2. It recognizes certain other words as acceptable if modified and modifies them 
accordingly* for example, by "use" directions for synonyms and near- synonyms. 

3. It adds additional indexing terms when certain words occur, an example being 
" ’penicillin', use also ’antibiotics 1 .” 

4. It deletes certain words if they do not occur in the context of an acceptable 
indexing phrase. 

2. 3 Tabledex and Other Special Purpose Indexes 

The uses of machine techniques in index compilation so far discussed represent 
instances in which conventional tools of bibliographic control can be prepared at lower 
cost or more rapidly, or both. In addition, however, certain new and unconventional 
types of index have been or are being produced with the aid of computers. 

The Tabledex method, as proposed by Ledley in 1958 (Ledley, 1958 [352], Zusman, 
et al, 1962 [66l]; O’Connor, i960 [442]^, involves coordinate indexing in bound book 
form, with special features to facilitate search, conserve space and display index terms 
co-occurring with a given term for a given item. A major advantage claimed for this 
method is that by the use of computers bibliographies and book- form indexes can be 
organized, compiled, and printed in page format within a matter of hours. 

A Tabledex index typically consists of a bibliography proper, in which each citation 
has been assigned an identifying number; an alphabetical list of the indexing terms used. 
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These investigators claim the first production of a conventional subject index by 
computer. 
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See, for example, O’Connor, i960 [446], p, 241: "Ledley approximately halves the 
average size of the document descriptions required by imposing an order on the 
vocabulary of indexing terms. When a document description belongs in a term subset, 
only those terms of the description need to be recorded which come later in term 
order than the term of the term of the subset. This illustrates another type o 1 * 
storage organization. 11 
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which may also have numeric codes; and a set of indexing tables. These tables contain 
item numbers in the leftmost column, and either the names or the codes for indexing 
terms assigned to an item along the row. There is one such table for each distinct term 
used in indexing the items. 

To facilitate searching, only those terms which are of higher numeric or alphabetic 
order than that for the term for which the particular table is compiled are recorded in the 
rows. Thus to make a search on several terms, the user turns to the table for the one of 
these terms that has the lowest term value, which table records all items to which the 
term has been assigned, and checks the rows of the table for the second lowest ranking 
term, the third, and so on. Variations in the Table dex method allow for the automatic 
assignment of numeric codes to the indexing terms based on relative frequency of use 
within the collection. Ledley also discusses methods for finding articles associated with 
all except one, all except two, or all except n of the given words in a search 
prescription. 1 _/ 



A first example of a computer-compiled Tabledex index was that to a bibliography 
prepared by the Library of Congress for the International Geophysical Year (Zusman 
et al, 1962 [66l]). —? The computer program for the IBM 7090 carried out the operations 
of assigning accession numbers, extracting index terms and compiling the term lists, 
determining frequencies so as to assign frequency numbers to the terms, organizing and 
preparing the tables, and developing an author index. Two formats were used, one giving 
terms by numeric code and the other spelling out the terms as normal words. The latter 
feature provides a measure of browsability in the system. U A Tabledex compilation 
pregram is also in use at the Applied Physics Laboratory of Johns Hopkins University 
(Olmer and Rich, 1963 [4543). 



Another coordinate index search tool, making use of what is in effect a document - 
descriptor matrix with special codes and column arrangements to save space and 
facilitate rapid scanning, is the £ can-Column Index suggested in 1960by O'Connor [449]. 
He further suggested the use of computers for compilation, as follows: 



"A computer can organize information about documents into a scan -column index. 
The input needed consists of the document identifications and their accompanying 
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Zusman, et al 1962, [66l] , p. ii: "... The word tables have the advantage that 
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the other words also associated with the article of that row". 
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index terms. . • end an indication of either the number of columns desired or the 
ctlumn density desired. The computer will determine the frequency of each 
term, the positive and negative correlations of terms, and the quantity of these 
correlations by counting or sampling key figures, such as the average number 
of terms per document. I' 1 then can assign column -character codes accordingly. 




In 1961, Costello described the use of computer techniques for compilation and 
computer printout of a dual dictionary for a coordinate indexing system using links and 
roles at DuPont's Polychemicals Department. After manual analysis, term- role assign- 
ments are keypunched, the cards are listed for editing including the elimination of 
synonyms and the indication of appropriate postings to more generic terms, and re- 
keypunched for conversion to magnetic tape. Tapes for posting of items and links to 
term- roles are merged by computer with tapes giving alphabetical equivalents of term 
codes and with appropriate syndetic indications for final output on an IBM 407 high-speed 
printer [l4l] . 



Still another instance of a coordinate index, modified t u show pre- coordination of 
terms as compiled by computer, is that ot the Electronic Properties Information Center 
(Johnson, 1963 [30ll). The system consists of abstract cards maintained in accession 
number order, together with machine printouts that pre- coordinate descriptors within 
nine major categories. The listings of pre -coordinated descriptors are arranged in 
three different indexes; alphabetically arranged within each category, alphabetized with- 
out respect to category but with code indication of the category reference, and a non- 
categorized listing arranged alphabetically in reverse order. Advantages of machine 
processing include the ease with which various statistical counts can be made, such as 
the average number of items in the sy stain for a given material and a specified property. 
Summary indications of the state-of-the-art in the field of interest can be obtained, "for 
the system will indicate not only areas where research has been done, but also areas 
where gaps in the literature occur, and a measure of the growth of research activities 
in the field can be developed. " 



2.4 Citation Indexes 



"A citation index is a directory of cited references in 
accompanied by a list of source documents which cite it. " 



which each reference is 
This is a relatively new 
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type o£ bibliographic search tool that would be almost impossible to compile without the 
use of machines. 1/ In at least one case, moreover, the availability of mechanical 
devices was itself the inspiration for the idea of a citation index to the scientific litera- 
ture. Garfield states in a 1954 paper that he was led to the idea of "Shepardizing" from 
an earlier concern with the development of citation codes or "coden” that would 
facilitate machine processing of bibliographic and index entries. .2/ 

The value of Shepard's Citations in tracking down precedents and decisions has been 
recognized in the legal field for many years. £/ The desirability of a similar tool for 
literature searchers in the fields of scientific and technical information was suggested 
about a decade and a half ago, when Seidell and others proposed its use for patent 
searching (Seidell, 1949 [54ll; Hart, 1949 [255]). In 1954, the Bush Committee in its 
considerations of the potential applicability of machines to Patent Office problems 
received a proposal from the Atlantic Research Corporation of Alexandria, Virginia, 
which was to cover "the development of a Patent Citation Index, comparable to Shepard's 
Citations”. 5/ln the period 1954-1956, both Garfield il/and Fano .Z/independently advocated 
the development of a citation indexing tool for scientific and technical literature. As 
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See, for example, Atherton, 1962 [25], p.4: "The volume of data to be processed 
is so massive that processing machines are a necessity"; Garfield 1954 [210], p.4: 
"Where such large volume of data is to be handled it must be expected that 
mechanical devices of high speed and versatility. . . would probably be a determining 
factor in the system's success. " 
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Garfield, 1954 [210], p. 2. 

4/ 

How to Use Shepard’s Citations [28lJ has been published periodically by Shepard's 
Citations, Inc. , Colorado Springs, since 1873. 

5/ 

U. S. Dept, of Co* “amerce "Report to the Secretary of Commerce. . . , " 1954 [620], 
p. 27. 

6 / 

Garfield [ 210, 211, 212]. Adair, writing in January, 1955, specifically acknow- 
ledges a suggestion of Garfield's (for 1955 [2], p. 32) but Garfield in turn credits 
Adair, (1963 [ 214] , p. 290). 

7/ 

Fano, 1956 [l9l], p. 3: "Let us accept, at least for the sake of this argument, the 
conclusion that linguistic associations between documents cannot lead to a satis- 
factory definition of a bibliography. Then the only other type of association for 
which evidence is available is that provided by simultaneous references in th-s 
literature, by the concomitant use of documents by experts as evidenced by library 
records, and by other similar joint events. " 
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Of today, there are at least five or six instance^ of citation indexes that have been pro- 
duced, sevexal different experimental investigations are under way, and new interest 
.has I'oen generated by the considerations of the Weinberg Panel • Thus: 

"Of the newer approaches to the indexing of scientific documents, the Weinberg 
Panel was particularly impressed with the citation index as a promising biblio- 
graphy tool. In order to learn more about this approach, the National Science 
Foundation is currently sponsoring the compilation and publication of extensive 
citation indexes for the fields of genetics and also for statistics and probability; 
and is supporting two kinds of experiments to evaluate different techniques for 
using citation data in indexes and searching systems in the field of physics." U 

In general, the principle of citation indexing is based upon the hypothesis that the 
bibliographic refexences cited by an author provide significant clues to the subject content 
of the author's own paper and/or that there is a certain commonality in subject between 
papers that cite the same references or that are co- cited. 2/ The principle can be applied 
to the compilation of bibliographical or indexing tools in several different ways. First, 
there is the method .of citedness, which groups for a given item the identifications of sub- 
sequent items that have cited it. The converse of this is, of course, the bibliography or 
reference list of a given item. U In the first case, we are concerned with "descendants," 
and in the list of references with "ancestors". 4/ 



U 

Committee on Scientific Information, 1963, [1353, p. 16. 

2 / 

Compare Adair, 1955, [2], p. 32, with respect to Shepard's Citations itself: 
"Since all of the cases listed under a given case have cited it, it follows that 
they must all be, more or less, pertinent to the case cited. " See also Kessler, 
1963, [320], p. 1: "This method . . . originated in the hypothesis that the biblio- 
graphy of technical papers is one way by which the author can indicate the 
intellectual environment within which he operates, and if two papers show similar 
bibliographies there is an implied relation between them." 

3/ 

See Saltou, 1962, [520], p.III-3: "A citation index consists of a set of biblio- 
graphic references (the set of 'cited 1 documents), each being followed by a 
list of all those documents (the 'citing 1 documents) which include the given 
cited document as a reference. A citation index is to be distinguished from a 
reference index which lists all cited documents under each citing document. " 

4/ 

See, for example, Tukey, 1962, [611], p. 5: "Any user's greatest need is 
likely to be for access to the latest information rather than to the oldest, but 
the latest items are children, not ancestors. Genealogy is important, but 
progress requires tracing descendants lung and Vandeputte, i960, [291], p. 11, 
make a similar distinction between "histoire" (antecedents) and "filiation" 
(successors). 
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A second method, implied in Fano's suggestions for the use of relative frequencies 
of association between items found in the literature, is one of ci tingn ess, which groups 
together items that cite one or more identical references. This method has been 
developed by Kessler and his associates as the technique of "bibliographic coupling" 
{Kessler, [317] through [323]. The purpose here is to identify groupings of related 
items where relatedness is defined in terms of the number of references shared by each 
of the members of the group with some given test paper or with each other. It is noted 
that where the citedness index and the reference list typically give the bibliographic 
references themselves as the searching or retrieval tool, the bibliographic coupling 
technique seeks rather to define groups of similar papers. U A third method, and one 
which may be combined with either of the other two, is to derive indexing terms for a 
given paper from the overlay of indexing terms previously assigned to any papers which 
it cites. Salt on U further suggests that: 

* Citation indexes could be used to extend a given set of index terms by 
starting with the terms attached to a given document or document set, and 
adding to them the 'related 1 terms obtained from new documents which cite 
the original ones. " 

The suggested advantages of citation indexing include the claims that this tool does 
not require trained indexers, that it is highly susceptible to mechanization (Garfield, 
1955 [2131 1956 [212] , 1957 [211]; Atherton, 1962 [25]; Becker and Hayes, 1963 [45]), 
and that it may cost significantly less than subject indexing. A major advantage 
claimed is responsiveness to user, rather than indexer, interests and view points. J!/ 
Some of the representative claims with respect to this factor are as follows: 



U 

See Atherton and Yovich, 1962 [26], p. 3: "Kessler's method, however, does not 
retrieve the references cited by a paper. Instead these references are examined 
to determine the 'bonds' between papers; e. g. , if two papers share six references, 
in common, they are said to have a 'coupling strength' of six. ^y applying either 
of two criteria of coupling, one can 'filter out smaller groups of papers' related 
to a given paper. " 

2 / 

Salton, 1962 [520], p. HI- 8; see also Desk, 1963 [356]. 

3/ 

Atherton, 1962, [25], p. 3. 

4/ 

See Atherton and Yovich, 1962 [26], pp. 3-4: "Garfield estimates cost of abstract- 
ing and indexing 200, 000 articles in one ye&r to be $3 million. He estimates the 
cost of a citation index for these same articles (approximately 3 million citations) 
to be $300,000." See also Doyle, 1963, [162], p.8: "The editing labor, the input 
preparation cost, and the automatic processing time are all so small that it's very 
likely citation indexing is destined for a great surge of popularity in the immediate 
future. " 

5/ 

Committee on Scientific Information, 1963 [135], pp. 55-56: "Because the inde* 
ing is based on the author's rather than on an indexer's estimate of what articles 
are related to what other articles, citation indexes are particularly responsive to 
the user's, rather than to the indexer's viewpoint. " 
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"The most feasible scheme for alerting individuals to what is of interest in their 
own field requires an on-going up-to-date citation index. For each narrow field 
of interest of an individual there are, it is believed with good reason, three to 
five to ten key items such that: 

(cl ) If he knew that a new item referred to one of his key items, 
the individual would be glad to skim the new item, 

(c2) An individual who skimmed all new items referring to one 
of his key items would be adequately alerted to the newest 
results in his own specialties. 11 

"A research worker who finds one article several years old can relate later 
developments by locating all subsequent articles that have referred to it. 
Corrections and errata can be brought together by a citation index. " ,2/ 

"Citation indexing will overcome artificial dividing lines that are drawn in various 
abstracting services."^/ 

"It is believed that citation indexes will be useful. . . in bringing together related 
materials in different fields where the interrelationships are not readily 
identifiable from other types of indexes." 4/ 

"Since the end product of a citation indexing is a listing which collects in one 
place the bibliographical descendants of a given cited author, bringing these 
titles together helps to illuminate for the searcher the extent and nature of 
information association patterns employed by other authors who had a similar 
or related interest to his own. Its development, therefore, serves as an 
approach to the user's frame of reference, not the indexer's." j>/ 

The importance of being able to pick up more than the principal subject matter 
clues is indeed an advantage of citation indexing. Garfield, commenting on the potential 
cross-breeding of interests, gives an example of a personal search for more information 
on the RCA electronic scanning pencil in which he was led to one of Busa's reports on 
machine use in philological analysis and to an article of interest in the field of informa- 
tion theory. f>! Garfield further points out that the cross-breeding can extend across 



U 

Tukey, 1962 [611], p.9. 

2 / 

Atherton, 1962 [25*], p.2. See also Garfield, 1955 [213], pw 1. 
3/ 

Atherton and Yovich, 1962 [26], p. 3. 

4/ 

Brown son, 1963 [82], p. 3. See also Garfield, 1957 [21l], p.4. 
5/ 

Becker and Hayes, 1963 [45], p.137. 

y 

Garfield, 1954 [210], pp.4-5. 
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changes of terminology with time, -i/ and Lipetz suggests that it can break down barriers 

* a rn a a * 4 + i L f 



* 2 / 
with respect to use of foreign literature. — 



Other claimed advantages relate to the usefulness of the citation index for purposes 
other than those of direct literature search. Such other purposes include identification 
of significant research by "equating frequency of citation with relative significance of 
subject matter", (Salton, 1962 [ 520] ), determinations of the number of references cited 
in a given field or by journal or publication date (Atherton, 1962 [25]), evaluation of the 
relative importance of various scientific journals (Westbrook, I960 [636]; Kessler, 1961 
[ 322]), tracing of trends in the history of ideas or in a particular field of literature 
(Brownson, 1963 [82]; Salton, 1962 [520]) — and empirical studies of the frequencies of 
self- citation, multiple authorship, and the lik-~ (Atherton, 1962 [25]). 

A number of disadvantages of the citation index are to be noted, however. First is 
the obvious lack of consistency between authors in terms of whether or not they cite the 
prior literature at all and in terms of the completeness and correctness of the citations 
they do make. ^ Atherton quotes Westbrook as saying: 

"Science is subject to changing fashions of interest that lead to a distorted 
number of published papers in a given subject and an inordinately high level 
of citations to any one who reports first on the fashionable subject. The 
method will not appraise work performed but not published. " j>/ 



Ibid, p. 6: "Changes in terminology are to a certain extent overcome through the 
citation approach, since the author who makes a reference to a paper that is forty 
or fifty years old is making the jump in terminology for us. " See also Barfield, 
1956 [212], p.ll. 

2 / 

Lipetz, 1963, [366], p.265: "It is reasoned that availability of a citation index 
derived from Soviet physics journals and approachable through familar American 
references should stimulate utilization of the Soviet physics journals in the 
United States. " 

3 / 

See also Reisner, 1963 [497], p. 71: "Citation indexes are receiving increasing 
attention as bibliographic aids and as sociometric tools. As sociometric tools, 
they are being used to explore the flow of information across national boundaries 
and from pure to applied fields, to determine the structure of a field, and to 
determine the 'value' of documents or authors." 

4/ 

See, for example, Doyle, 1963 [162], p. 8: "The disadvantages of this kind of 
indexing is, of course, that it depends on authors providing ample and suitable 
references"; Salton, 1962 [520], p.III-7: "In many cases personal preferences 
are evident both as to numbe r and types of papers cited; authors have varying back- 
grounds, and there may also exist a tendency toward self- citation regardless of 
relevancy"; Thompson, 1963 [600], p. U-l: "The difficulties. .. are largely due to 
the extreme variability of format and to the lack of standardization which prevails 
in the publication of citations." 

y 

Atherton, 1962 [25], p. 4, citing J.H. Westbrook. 
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An author not cited frequently enough or not cited within a given time period will 
not appear in the citation index. Doyle points out that there are "many kinds of documents 
we would like to retrieve where it is not customary to provide citations at all"* A/ In the 
bibliographic coupling method, both those papers which make no references to any other 
paper and those papers which do not share at least one reference with some other paper 
in the system are automatically excluded. 

\ 

i 

Other disadvantages of tLe citation indexing technique relate to difficulties of the 
lack of standard practices in the citing of references and to problems of recognizing 
whether one citation is or is not equivalent to another. These are, of course, related to 
the normal difficulties arising from non- standardized formats and practices in descriptive 
cataloging, in use of journal abbreviations, in transliterations of foreign language titles 
1 and names, and the like, but they are now aggravated by the present prospects for direct 

machine processing. As Lipetz points out: 

11 Author’s names may be cited in somewhat different ways, and there is no 
simple mechanical procedure for bringing together the different versions. 

For example, an author’s name may be cited both with and without initials; 
it would take a comparison of the additional information on the cited reference 
to establish that these authors arc the same. Even more difficult are the 
problems of mechanically determining that a misspelling has occurred. " 

'• Both the disadvantages of incomplete and disproportionate coverage and of failures 

to equate equivalent citations are quite readily obvious to the user of a citation index if 
he is reasonably familiar with the subject field or document set that is covered. Thus, 
the use of the citation index as the exclusive tool for literature search is subj ect to 
defects of both oversight and 'over- cite’ which are cumulative and which are often easily 
recognizable. Atherton and Yovich emphasize that: "Knowledge of these weaknesses 
1 tends to prevent anyone from trusting the system's ability to retrieve the pertinent 

literature. " 

In general, however, the citation index has not been proposed as an exclusive 
1 means for literature search and retrieval, but rather as one of a set of tools or as a 

^ supplement to other indexes. — ' In this connection, it is of interest to note that a manual 

' technique of literature search tested at The Thermophysical Properties Research Center 



1 / 

Doyle, 1963 [162], p. 8. 

2 / 

See Atherton and Yovich, 1962 [26], p. 3')* Marthaler, 1963 [399]* p. 23. 

3/ 

Lipetz, 1962 [364], p. 262. 

4/ 

Atherton and Yovich, 1962 T 26], p. 39. 

5/ 

See, for example, Tukey i , [611], p. 10: "The citation index, in its retrieval 

and pursuit uses, is not something to be used alone. Rather, it is the tool whose 
presence makes all the other tools more effective." 
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while not using a citation index as such, makes use of a- supplementary citation tracing 
technique both to shorten manual search time through abstract journals and to follow up 
additional search leads (Lykoudis, et al, 1959 [387]; Cezairliyan, 1962 [1073). The 
technique is briefly described as follows: 

"One starts searching the abstracting journal beginning with the most recent 
issue and going back through a number of years, a. Next, the bibliographies 
of the papers located in these a years are searched fcr new references. The 
references found in this second step of the search will, in general, cover a 
period of years (b -> a). Then one reverts back to searching through the ab- 
stracting journal again for another period of a years starting with the year b. 

This cyclic procedure of alternate searches through the abstracting journal, 
followed by searching the bibliographies of uncovered papers, is repeated until 
the total number of desired years of search is covered. " 1 ./ 

In a sample search on the thermo physical properties of metals, the results showed 
that the cost of the cyclic procedure was only 65% of the cost of conventional manual 
search using the abstract journals only. 

Recent efforts in the development and use of citation indexes proper include experi- 
ments in evaluation at the American Institute of Physics, .?/ .an extensive compilation and 
processing program at the Institute for Scientific Information, U and a cooperative pro- 
gram between the Statistical Techniques Research Group of Princeton University and the 
Bell Telephone Laboratories (Tukey, 1962 [6ll] and [612]). Reisner has re- 
ported work on the compilation of a citation index to 30, 000 patent disclosures and its 
experimental evaluation in progress at IBM's Thomas J. Watson Research Center (1963 
[497]). Goodman is concerned with a citation index to the literature of new educational 
media, especially that on programmed learning and teaching machines (1963 [235]). 

At the Centre d'Btudes Nucleaires de Saclay, a citation index to papers in the field 
of thermonuclear fusion and plasma physics is being prepared. kJ Lipetz is carrying on 
work in the preparation and evaluation of citation indexes, begun at the Itek Corporation, 
as an independent worker and consultant to the A. I. P . project. — Carroll and Summit 
report that citation indexing is under consideration at Lockheed's Missile and Space 

Division, (1962 [102] ). Kessler and associates at M. I. T. k! and Salton's group at 

_ 

Lykoudis et al, 1959 [387], abstract, p. 351. 

£/ 

Atherton and Yovich, 1962 [26]; National Science Foundation's CR&D Report 
No. 11, p. 12. 

3/ 

Ibid, pp. 27-28. 

i/ 

Ibid, p. 76. 

5/ 

Ibid, p. 181. 

6 / 

Ibid, p. 128. 
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the Harvard Computation Laboratory (Salton, l*?6l [512], 1962 C 51 3], 1963 [514] and 
[515]), are concerned with citations as a basis for grouping and categorizing sets of 
related documents. 



Early examples of citation indexes that have been produced include the precedents 
in the fields of statistics and information theory listed by Tukey. — Tukey also refers to 
early experimentation involving manually manipulated card files by J, L, Hodges, Jr. , 
Charles H. Kraft, and William H. Kruskal* — ^ Goodman (1963 [235]) describes the us ^ of 
Termatrex cards showing for each item other items cited by it. 



Examples of machine- compiled citation indexes, however, are those of Garfitid ?.nd 
Sher in the field of genetics (1963 [546]), Lipetz's experimental index to the citations i. 
the proceedings of the two United Nations conferences on the peaceful uses of atomic 
energy, (1961 [364], i960 C 365]), and the citation index to references listed in the 
"Short Papers" submitted for the 1963 Annual Meeting of the American Documentation - 
Institute (Luhn, 1963 [37?]). As of January, 1964, the first five volumes of Science 
Citation Index a re available from the Institute for Scientific Information. These volumes 
are reported t have 2,250, 000 lines of copy representing the computer-compiled citation 
trails for 102, 000 articles published in 1961. JL' 

Preliminary evaluations of the citation indexing principle have, as noted previously, 
been carried out in an American Institute of Physics project supported by the National 
Science Foundation. One experiment involved the selection of a single paper from the 
December 1, 1961 issue of The Physical Review and the tracing of references and citations 
through that journal for the period 1956 tp 19.6IL A bibliography of 64 papers was pro- 
duced as a result* This was then evaluated by a nuclear physicist, who found that the 
titles alone were an insufficient basis for judging whether or not these papers should all 
have been included, and who commented critically that there was no way of knowing 0 all 
the papers really relevant to the subject of the test paper had indeed been found. A 
further check by search of the subject index did in fact reveal six pertinent papers which 
had been missed by the citation indexing technique. 



A second experiment at. the American Institute of Physics involved application of 
Kessler 1 s "coupling strength 11 criteria to 41 of the 64 papers selected in the first 
experiment the remainder being excluded because they shared no references with any 
other paper. The resultant groupings of presumably highly related papers were also 
evaluated by a subject matter specialist, who found them relevant to each other but the 
selection incomplete. Atherton and Yovich, reporting these A.I. P. experiments, con- 
cluded that: "More work will have to be done before the usefulness of citation indexing 
can be accurately determined. " — / 



U 

2 / 

3/ 

4/ 



Tukey, 1962 [6ll], pp. 23-24* 

Ibid* p* 24. 

See news note. Special Libraries , Jan. 1964, p. 58. 
Atherton and Yovich, 1962 [26], p. 22. 
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Kessler himself and his associates have also conducted some experiments in 
comparative evaluation of indexing aids derived from citation data on the one hand and 
from conventional subject indexing on the other. The basis for evaluation was a total of 
334 papers published in The Physical Review in 1958. The study involved detailed 
comparison of the ways in which these papers fell into related groups according to the 
"analytic subject index" used by the journal's editors and according to the method of 
"bibliographic coupling". The essentials of the latter method are described as follows: 

"a. A single item of reference used by two papers is called one unit of coupling 
between them. 

"b. A number of papers constitute a related group G^, if each member of the 
group has at least one coupling unit to a given test paper Pq. 

"c. The coupling strength between Pq and any member of G^ is measured by 
the number of coupling units (n) between them. " U 

For the 334 papers, 73 categories of the Analytic Subject Index (ASI) had been used. 
For the bibliographic coupling method, each of the papers was in turn considered as the 
test paper and groups were formed for any of the 333 other papers that shared one or 
more citations with it. In general, it was concluded that there was good correlation 
between the groupings of papers achieved by the two methods. It should be noted, how- 
ever, that 44 papers fell into no groups at all on the basis of the bibliographic coupling 
criterion. 2/ 

Salton and associates at the Harvard Computation Laboratory are also concerned 
with the citation indexing principle as a possible basis for grouping similar documents. 
They are also concerned with evaluation of results so obtained by comparison with 
document groups obtained by subject indexing means. In the comparative experiments, 
data were first compiled for a closed document set of 62 items as to similarities with 
respect to both "citedness" and "citingness". The same items were manually indexed 
and similarity coefficients between these items were derived from overlappings of 
assigned index terms. When the two measures of similarity were compared with each 
other and with document associations obtained by random assignments of "citations" and 
"terms", the conclusions reached were as follows? 

"The similarity coefficients obtained by comparing overlapping citations for a 
sample document collection with overlapping, manually generated index te^ms 
are much larger than those obtained by assuming a random assignment of 
citations and terms to the documents; relatively large similarity coefficients 
are generated for nearly all documents which exhibit at least a minimum 
number of citations; little seems to be gained by using citation links of length 
greater than two; for early documents, citedness furnishes a better indication 
than the amount of citing, and vice versa for recent documents; for documents 
which can both cite and be cited, equally good indications seem to be obtained 
by comparing citing and cited documents. " -2V 



1 / 

Kessler, 1963 [320], p. 1, footnote. 

2 / 

Ibid, p. 5. 

3 / 

Salton, 1962 [520], p. HI- 42. 



In the Salton project, tests of the value of citation links for the assignment of index 
terms have been made by comparing the citation pattern of an "unknown" document with 
those of other documents in the collection to derive a set of five "related" documents, 
where relatedness is decided on the basis of the magnitude of the similarity coefficients 
for the citation links. Any index term that appears at least twice in the set of terms 
previously assigned to the five related documents is then assigned to the new item. In 
general, approximately 50% of the terms so assigned were also assigned to the same 
"new" items by human indexing procedures. 

As wc have previously noted, however, the advantages of citation indexing are likely 
to be most effectively applied when used as part of an array of other tools. Tukey 
suggests, in particular, that permutation indexes of titles, as in KW1C systems, would be 
of great value as "starter" and "re-check" mechanisms for the use of citation indexes.!/ 
Browns on reports: 

"Consideration is now being given to the possibility of experimenting with a 
'hybrid' type of index that would combine permuted titles, authors, and citation 
data. Such an index might be more useful than any of the individual types of 
indexes issued singly; and, since no human indexing judgment would be involved, 
it could be prepared largely by machine and issued rapidly. 11 1 / 

Williams, while at 1TEK, proposed a hybrid integrated index combining listings by 
authors, corporate authors or author affiliations, keywords -in -context from title, and 
references to works cited by and to works citing an item, and she also developed a sample 
format for selected items from several journals in the field of philosophy, i/ 

Precisely such a hybrid tool was provided with the Short Papers for the A. D.I. 
Annual Meeting 1963, and it was indeed issued rapidly. A brief period of only two or 
three weeks elapsed between receipt of many of the manuscripts and the distribution of 
two automatically typeset volumes. The second of these volumes contains a KWIC and 
an author index to these papers themselves, a bibliography and citation index to all 
papers referenced by them, and KWIC and author indexes to the cited papers, all 
computer -compiled within this time period. £/ 



11 

Ibid, See also Lesk 1963, [357], p. V-8. 

2 / 

Tukey, 1962, [6ll], p. 12. 

3/ 

Browns on, 1963 [82], p. 4. 

4/ 

T. M. Williams, private communication, dated January 4, 1962. 
5/ 

Luhn, 1963 [376], and [377] , pp. 353-382. 



2. 5 Machine Conversion From One Index Set to Another 



A final possibility in the general area of machine compilation of indexes and machine 
use to improve the availability of indexes is as yet in a highly speculative stage. This is 
the possibility of converting from one index set to another by machine look-up procedures. 
In the Welch Medical Library project, mentioned earlier, use was made of punched card 
techniques to convert from one index arrangement to another, 1/ but machine- 
recognizable identifiers for both arrangements were explicitly encoded in the material. 

In recent studies at Datatrol, however, preliminary investigations have been conducted 
looking toward machine lookup of index- term equivalence tables in order to convert, for 
example, DDC descriptors to corresponding subject headings used in the AEC vocabulary. 

Hammond and Rosenborg (1962 [2503 .and [252]) report on the compilation of a uni- 
lateral' table of "indexing equivalents" between approximately 7,000 DDC descriptors and 
those AEC subject headings judged by them to be identical, synonymous, or "usefully" 
equivalent, such as one or the other being subsumed by a broadei or more generic term. 
Findings showed 23.8% of the terms of the DDC vocabulary presumably identical to those 
of AEC, 38- 1% of lower generic level, 7. 4% of higher generic level, and 10. 9% for which 
no useful equivalents could be found. A sample table of indexing equival exits was prepared 
for DDC -to -AEC conversion, but not in the opposite direction. 

Since, in general, convertibility of indexing vocabularies would be desirable 
wherever duplication of cataloging and indexing effort is lilcely to occur (that is, where 
two or more different documentation organizations receive at least some of the same 
material as inputs to their systems), the results of these preliminary studies are pro- 
vocative jnd appear to merit the further study that is being sponsored by an Interagency 
Task Group on Vocabulary Study of the Committee on Scientific Information, under the 
Federal Council for Science and Technology. 

There are many substantial difficulties, however. When applied to actual indexing 
of the same items by the two agencies, it w*as found that for 277 items indexed by both 
AEC and DDC (then ASTIA): 

"ASTIA used a total of 2, 571 descriptors, and AEC 840 subject headings. . . of 
these, 392, or roughly half of the AEC terms, were either completely or, for 
all practical purpose, identical. " U 

Painter (1963 [4603) made further studies of equivalency in her investigations of 
duplication and consistency of subject indexing at several Government agencies. For 200 
items indexed by both AEC and DDC, she found 20% DDC equivalency, 67% AEC equiva- 
lency, and 30% similarity of actual indexing. She concludes, in part: 

"In considering these solutions and the statistics revealed by the studies it should 
be concluded that with a maximum of only 69 percent equivalency, or convertibility, 
and a minimum of 28 percent, there is still a large proportion of terms which will 



u 

Garfield, 1959 [2213, p. 471. 



2 / 



Hammond 1962 [25 03* p. 4. 



necessitate some other form of retrieval. This is the proportion which is involved 
with the problem of generics, where a term in one system subsumes two of another 
— and vice-versa. An additional problem evolves in attempting to reconcile two 
different subject concepts, one, the subject heading which usually has a single 
access point and one, the Uniterm or descriptor which has multiple access through 
coordination. Thus the practicality of a system made up of many units supplying 
information indexed differently, using as a basis for retrieval a table of equivalents, 
is questionable." — 

Moreover, the results of tests of inter-indexer consistency rates within the same 
agency were not encouraging. Thus Painter further concludes: 

"Tne study, in combining the results of the equivalency analysis and the consistency 
of indexing within each system and an equivalency of only 30 percent within the 
broadest system, a table of equivalents is at present of little value in either a 
manual or a machine system. In order to apply a table of equivalents efficiently, 
both a high degree of consistency and a high degree of equivalency is essential. 11 i 'J 

She therefore stresses that the possibilities for conversion by machir ^ techaiques 
from one indexing set to an equivalent set for another vocabulary are advert ?ly affected 
by the generally poor rates of inter-indexer consistency. "With reference both to the 
Datatrol Studies 3/ an d to corroborative findings of her own, she states: 

"The valu~ of equivalency studies and most particularly the table of equivalents 
pret ppose the consistency of indexing. Convertibility between systems is thus 
dependent on the consistency of indexing. Withe ut consistency, the vocabularies 
as units are not sound; equivalencies cannot be drawn or effectively used for 
convertibility." 
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Painter, 1963 [460], p. 104. 

2 / 

Ibid, p. ix. 

3/ 

Hammond, 1962 [250]; Hammond **nd Rosenborg, 1962 [252]. 

4/ 

Painter, 1963, [460]. p. 109. Note that these estimates of : nter-indexer con 
sistency may be quite optimistic, as discussed on pp. 157-l60of this report. 
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3. 



INDEXES GENERATED BY MACHINE- -AUTOMATIC DERIVATIVE INDEXING 



We have noted, in the earlier statement of the scope of this survey, a distinction 
between "derivative" and "assignment" indexing. This distinction is related directly to 
the question: "Is what can be done by machine properly termed 'abstracting', 'indexing 1 , 
or 'classifying'?" It relates also, as we have remarked, to a continuing controversy far 
older than any question of the introduction of machine techniques — that between "word" 
and "concept" indexing, between "uniterms" if selected directly from the text and 
"descriptors" in tne sense of their being indexing terms selected so as to have "a care- 
fully specified meaning for retrieval", to say nothing of contrasts with subject heading 
schemes and classification schedules. 

Some of the major arguments pro and con dei Native (usually word) and assignment 
(usually concept) indexing will be considered in a subsequent section of this report on the 
problems of evaluating indexing methods. Nevertheless, the present popularity of 
automatic derivative indexes of the KWIC type, while subject to all the disadvantages 
typically cited for all purely derivative indexing systems, does show the actuality of 
automatic indexing potentialities and may in fact hold the promise of solving some of the 
present-day problems of subject control. 

In this section, we shall consider first the straightforward word extraction tech- 
niques used in KWIC type indexes. Possibilities for modified derivative indexing by 
title augmentation, manipulation of word groups and use of special clues In keyword 
selection are then discussed, including work by Baxendale, Luhn, and Artandi. Related 
research and developments efforts work in automatic abstracting which lend themselves 
to derivation of indexing terms includes proposals and experiments by Luhn, Oswald, 
Edmundson, Wyllys, Doyle, and Lesk and Storm, among others. Some comments will 
be given on the quality of modified derivative indexing by machine. Automatic derivative 
indexing at the time of search, as in the natural language text searching systems of 
Swanson, Maron, Kuhns, and Ray, and Eldridge and Dennis, will be discussed in a later 
section of this report. 

3. 1 KWIC Indexes 

The development of computer- generated permuted -title keyword indexes, especially 
in the issuances of Chemical Titles and B. A. S. I. C. (Biological Abstracts -Subjects -In 
Context) has been hailed by some as "the miracle of the decad*" and "the greatest thing 
to happen in chemistry since the invention of the test tube". The major reason for 
the optimistic enthusiasm is the speed with which the computer can produce can produce 
a complete index to some specific set of books, documents or papers so that publication 
and dissemination of the index can be prompt and thus serve as an important tool in 



1 / 

Mooers, 1963 [423], p. 3. 

2 / 

See pp. 132 -136* 

3/ 

Quoted by D. R. Baker statement in "U. S. Congress, Senate Co mm ittee on 
Government Operations", 1960 [619], p. 169. 



40 



maintenance of truly current awareness. For example, Herner in his 1961 review of the 
state-of-the-art of organizing information says; 

"I am told that the American Chemical Society has never had a more successful 
basic science publication. The key to the whole thing is, 1 believe, the extreme 
currency of Chemical Titles. This in turn derives from the speed and simplicity 
of the KWIC process." V/ 

Conrad reports as follows: 

"Reception of B. A. S. I. C. ... has been so extremely enthusiastic .. . that we 
are excited by the possibilities of producing permuted title indexes in one or 
more additional languages. The creation of a B. A. S.I. C. index in any language 
requires only that the titles be translated and punched on cards. Alphabetical 
arrangement, permutation and 'type-setting' is completely automated and, for 
5, 000 titles takes only two hours to accomplish. " fL' 

3. 1. 1 Applications of KWIC Indexing Techniques 

The KWIC type process is indeed simple and straightforward. The words of the 
author's title are prepared for input to the computer by keystroking, either to punched 
cards or to punched paper tape. After being read by the computer, the text of a title is 
normally processed again sc a "stop list" to eliminate from further processing the more 
common words, such as "the", "and", prepositions, and the like, and words so general 
as to be insignificant for indexing purposes, such as, "demonstration", "typical", 
"measurements", "steps", and the like. The remaining presumably "significant" or 
"key" words are then, in effect, taken one at a time to an indexing position or window, 
where they are sorted in alphabetical order. The result is a listing of each such word 
together with its surrounding context, out to the limit of the line or lines permitted in a 
given format. As each keyword is processed, the title itself is moved over so that the 
next keyword occupies the indexing position, and this process is repeated until the entire 
title has thus been cyclically permuted. 

A number of formats are available in which the length of the line, the position of 
the indexing window, and the extent of "wrap-around" (bringing the end of a title in at the 
beginning of a line to fill space that would otherwise be left blank) are major variables. 
Current examples of KWIC type indexing output are shown in Figures 2 through 7. 
Usually, the indexing window is located at or near the center of the line with several 
extra spaces to the immediate left or with other devices such as the shading of 
B. A. S. I. C. to aid the searcher in scanning down the keywords listed. This i‘a 



1 / 

Herner, 1962, [266], p. 10. 
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Conrad, 1962 [137], p. 378A. 
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EFFECT! OF *h£ ADHlH UtRATl OR OF SULFDHAHlOE <T HAY OF THE 
IXOCRlHE DUCTS OH ThE GLyCcHJA AND HIStOLDQICAk STRUCTURE OF 
THE PANCREAS* * A LOUS AT IE AES# A SASSTttfi# N H HaRIaHI# 

C FAUTEAU DE LACLDI * C R SOC IJOL PAR VISA PlS*-l# ISO! FR 

EFFECT OF SOOlUH ACETDACETATE OH DLYCEHlA. * H TOTH# L IaRTA * 
ACTA NED ACAD SCt HUHG VlS P343-6# l*6t FR 

HDDIFJCATIDH3 tH GLYCiHU AHO DLUCOSE LCaBIRG CURVE IN 
AHJHALS HITH CHROHtC LESJOH! OF THE SPltfAL CORD * 

B PtHHA* H S OECHERCHt * !OLL SOC tTAL !tOL SPEft V35 PI!**-!# 

31 OEC 5* tT 

THE GLYCEHtC CYCLE CDHPaHEO NlTH YHE INDUCE* HYPERDLYCEHIa 
T ilT AHO THE FASYIHD GLYCEAlA, ITS JHPOR1AHCE 1H ftTABEYlCI# 

EVER JH THOSE APPARENTLY IN EQUILIBRIUM. llNPLlFtEO 
PERFDRHANCE OF TEST UStNO AUTO-H I CftO-SAMPLlNGl. * 

C PEREl * TUNJStE HEP V 3< P10t-£0t# HA ft 61 FR 

IFFiC* OF DV r ASTJNULAttON OF THE CNS ON GLyCENIa |N RATt 
JN VARIOUS CONDITIONS. * N SUOOVA * CESX FYSJOL V| PS5I# 

NOV SO C I 

THE DLYCEHtC CYCLE COHPARED HttH THE INDUCED HYPERDLYCENt A 
TEST AND THE FASTlH* XTCEHlA. ITS iHPQRUNJfc IN OlAlEtlCI 
EVEN EN THOSE APPARENTLY JN EQUILIBRIUM# SIMPLIFIED 
PERFORMANCE OF TEIt UStNB AUT O-NICNO-SaHHLIHQS* * C FERE! * 
TUNJSJE NED V3! PiQ*-Z«# HA ft 60 FR 

EFFECT OF INSULIN ON OLYCEHtA STUDIES BY HEaNI Of IEHPORaRy 
AND PERHANENt NETHOOS OF LtGATtON OF THE V# PORTAE AND V. 
HEPAT1CUN IN RATS. * ft XOREC * CESX >Y*1UL VI PZB# JAN 6! Cl 

glycChic 

THE AHtHOACtDEHtC AND ILYCEMIC RESPONSE IN UtCEN PATIENT! 

AFTER INTRAVENOUS LOAD OF AMINO ACIDS* * J GIORGIO# V CLtVA * 
ROLL SOC JTAL BIOL SPER V35 P1I64-I* IS SEPT Si IT 

ETFECTS OF CHLORPROHAltNE ON CERTAIN OLYCEHIC TESTS IN 
CNILOREN* * V TJBCHLER# J JACtNA# ! HRUBA# D PAVNOVCtXOVA 
CESX PEDJAT V 14 *677-11* AUG SO Cl 

THE GLYCEHtC CYCLE COMPARED HITh THE INDUCES HTPER4LtCENU 
TEST AND THE FASTING OLyCEHlA, ITS IhPORIANCE IN DIABETICS# 

EVEN IN THOSE APPARENTLY IN EQUILIBRIUM* SIHPLIFJED 
PERFORMANCE OF TEST US1N0 AUTO- HI CHD-SANHL1NG0. * 

C PEREl * TUN1S1E HED V 3! Pllt-lG9# NAR 60 PR 

NEURAL REOULaTIDN Of INDUCED GLYCEHtC REACTION. * E GUINANN# 

0 JAKDUREX * CE$K FYS1DL V! P4Q4-S# SfcPf SO Cl 

GLYCENIC CURVE 

CHANGES OF GLYCEHtC CURVE FOLLChING 1NE ADHlNtSTRAT 1 ON OF 
GALACTOSE IN HEAD INJURIES* * t HAVL1N * CESN FY$10L V! P3l7# 
JULY SO Cl 

EXPERIMENTAL CONTRIBUTION TO THE STUDY OF THE INFLUENCE 
EXERTED eY PERIPHERAL TISSUE ON OLYCEHIC NDHEO$TAStS. tt. THE 
OLYCENIC CURVE FHDH ADRENALINE. * C CORDOVA* G D BOHPtANt# 

0 PALHA * BOLL SOC ItAL H10L SPER V3S P1S66-0# 1$ DEC SO IT 

EXPERIMENTAL CONTRIBUTION TO THE STUDY OF THE INFLUENCE 
EXERTED BY PERT P HER AL TISSUES ON GLYCEHtC HOMEOSTASIS. tit. 

The GLYCEHtC CURVE FRDN INSULIN. * 0 PALHA# C COftODVA* 

0 D BORPtANJ * BOLL SQQ t*AL BIOL SPER V35 P1S7Q-3# 10 DEC SI 
JT 

GLYCEhJC CURVES IN NORMAL SHEEP TOLLDHlNG THE ADHlNtSTRAT ION 
OF CHLORINATED HYDROCARIUNS. * E KONA * CESX FYSlOL VB P3ZZ# 
JULY SO Cl 

GLYCEHtC H0H6GSTASIS 

EXPERIMENTAL CONTRIBUTION TO THE STUDY OF tHE INFLUENCE 

Exerted by peripheral i issue Cn gltcehic kqkEoSTasis* tt. The 

GLYCEHtC CURVE FROM ADRENALINE* * C CORDOVA# 0 B BOHPtANt* 

0 PALHA * BOLL SDC StAL HtOL $PER V3S P1S66-Q# It OEC SI IT 

EXPERtNEUAL CONTRIBUTION TQ TkE STUDY OF TH& INFLUENCE 

Exerted dy peripheral tissues on ouCehIc hohEobIaSiI' 1 1 1 < 

THE GLYCEHtC CURVE FROH tRSULlR. * G PALHA# C COftODVA* 

0 0 BOHPtANt * BOLL BOC HAL *tOL SPER V3S 11*70-3# IB DEC BP 
IT 

OLTCERATE NtNAlB 

PHOSPhCRYLaTIDR OF O'OLtCERIC At 10 TO 2*pHU3PHCl-b-BLYCERtC 
ACtD HITH ELYCERAIE RIRAIE tR THE LtvER. t. OR THE 
iiochehioTry of fructose METABOLISM* tl# * u LANPREChT# 

T DlAHANYSlEtN# f HfilNl# P IALOE * HOPPE BEYLEH l PHYSIOL CHEH 
V 316 PQ7-112# 30 SEPT St GER 
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kIyhoHo imBcx mi 



RLYCERJC Ado 0 

phoRPhOrylaTTon of o-glyceric acid To E-phOBphd-d-olvcerjC 
aCIO NtTH ILYCERATE NINA?! lit TkE Llvftft* T. OR THE 
IIDCHENISTRY OF FRUCTOSE METABOLISM. It. * H LANPRECMT# 

T DlANANlltElH# F NElttl# ? IALOE * HOPPE IEYLER l PHYSIOL CHEN 
V 316 Pt7*ltZ* 31 SEPT Bb GER 

RlyceRIDe 

Influence Of iriuLIr on the Incorporation df 2-i4 c-sdoiuh 
PYRUVATE ihto olycEriBe glycErdl }* oIabettc a HR NORMAL 
•ABOOnI. * R BAVABi* J OlLLHAN# C GILBERT * NATURE LOHD V IBS 
P16B-P# 16 JAN 61 

GLYCERIDE jgltcErdl 

METABOLIC ROLE OF GLUCOlfc* A BOUNCE OF BLYtERtDC-OLYCERCL tH 
CONTROLLING THE RELEASE OF FATTY AC10S it ADIPOSE TISSUE. * 

F C HOOD JR.# I LEIOEUF. G F CAHILL JR. * DIABETES VI P261-3# 
JULY-AUD 61 

GLYCEROL 

EFFECT D? EPINEPHRINE Ok pLUCOSE UPtANE AN& GLYCEROL RELEASE 
BY ADIPOSE TISSUE IN VtTMO* * I LEIOEUF# I FLINN* G F CAHlLL 
JR* * PRDC SOC EXP OIDL HED V1D2 PS27-P# OCT-DEC St 

IHFLUEHCE DF INSULIN ON 1HE iNCONPDNAT J ON OF 2-1* C-SDOJUH 

pyruvate Into gltcEnide glyceRdl in oiaietic ano ndrhal 
addons, * h Ravage# j bjllman# c gilbert * nature lord viis 

161-t# 16 JAR 61 

urthpaUEd RyrThesis of FATtY acids and altered synthesis dF 

GLYCERDL OF TRIGLYCERIDES IH OIAIETIC BABOONS p. URSJRUS. * 

R SAVAGE# J B1LLHAH# C GILBERT * S AFR J HED SCI V2S Plt-31# 

APR 60 

GLYC1DE 

MATERNAL RlyCIIE NONNA. ASSIMILATION* fOHATD BAtY# PRECEDENTS 
DF HACRDIDHIA AND FETAL MORTAL ITt* * | SALVADOR!# 
fi CABNAJIO # DELEDNAftOlS * HJNERVA PEDJAT VIZ P117# 

11 FEB 60 IT 

GLTClNE 

AN lRtULiH AlSAY BASED OH THE INCORPORATION OF LABELLED 
OLTClNt Jft^o Protein of isdlaIed rat djaphraor. « 

R L RARChESTER# p J PAROLE# F G YOUNG * J EROOCR Vlt PZS0-6Z# 
DEC SO 

HAIRTEhARCE OF CaPIDhyDRATE STORES DURING StRESS of COLD AHO 
FAtlGUE IH RAtS PREFED DIETS CONTAINING ADDED QLYC1NE. * 

U R TDDC# M ALLEN * USAF ARCTIC AERDREO LAI TECHR REP VS7 *34 
PI' 16* JURE 60 

BLyCIRE C14 

RATE OF ASSOCIATION Of S3S anB C14 JH plaSha PROTEIN FRACTIONS 
AFTER ADMINISTRATION DF NA2S3S04# 0LYC1NE-C14# OR GLUCOSE C14. 
* J E RlCNNDNb * J jSlOL CMEH V234 P2713-6# OCT SI 

GLTCOGEN 

BLyCDOEN Of THE AbftENAL CORTEX AND HE DULL A* INFLUENCE OF A4| 
AND SEX. * H PLANEL# A GU1LHEH * C R SOO BIOL PAR V1S3 PB44-6# 
ltBO FR 

EFFECT Of DIET OR ThE BLOOD SUDAN ARB LIVER GLTCbttEH LEVLl OF 
NORMAL AND ADRENALECTDM W€D HtCE. * I P ILOCX# G S COX * 

NATURE LDND V1C4 IUPPL IS P7Z1'Z# Zl AUG SO 

LIVER GLYCOBEN AND BLOOD SUGAR LEVELS IN AORENAL*DEHE OULLATED 
ANO AOAENALECTDNIIEO ftATS AFtEft A SINGLE DOSE DF GRDHTK 
HbRNONE* * C A 6E SNOOT * ACtA PHYSIOL PHAftHACDL NEERL Vt 
P1D7-Z0# NAV 6P 

A HtCftOHETHOD FOR BIRULTaREDUS DETERNtNATtOH OF GLUCOSE ARO 
XETONE BODIES IN BLOOD ABB GLYCOGEN AND XEtORE BODIES IN 
LtvER* * D HANSEN * SCARB J CLIN LAB IRVElt V 12 Pll-24# 1*60 

an Inverse RtLAtlbN oETbeEr the lJvEr glycogen and the blood 
olucoSE In ThE PaT adapted To a fat diet. * p a hates * mature 

LOND V1B7 P31S-6# 23 JULY 60 

LIVEr OLUCOStL OL I GOt aCCHaN I OES AND GLYCOGEN CARIDN-14 
OtDXlOE EXPERIHENYI NtTH HYDROCORTISONE# * H G $1E# 

J ASHNbRE# R MAHLER# N H FIShHAR * NATURE LORD V1I4 P136I-1# 

31 ftCT S* 

STUDIES Or oltCoOEn BIOSYNTHESIS in guinea pig corhea it 
MEAN* OF OLUCOSe LABELED HlTH C14* * R PHAUS# 

J DIERIERGER# J VOTOCNOVA * CESK FTSiOL VO P4 S- f 6 * JAR 61 CZ 

OLTCOGER CONTENT aNO CAftlOHYftRtTE METABOLISM OF THE LEUKOCYTES 
tH DIABETES HELLITUS* * G NAEMfl * H1EN 1 INN HED V4D P330-4* 
SEPT SO OER 

GLTCOGEN LlvEft. AH IaTNOGEHIC ACUTE AIDDHIhaL DISORDER IN 
01ABETES HGLLITUI. * A SCHOYTE# H X LANKAMP# H FREhREL * 

RED T GEREEIR V IB 3 PZZSG-6Z# 7 HOV St OUT 

ACUTE BLYCOGER INFILTRATION OF THE LIVER IN DlAtETES HELLtTUS* 
■Z. THE EFFECTS OF GLUCAGON THERAPY. » A SCHOYTE# N H LANNANP# 
H FRENNLL * NED T GENEESX V104 P12BI-01* 2 JULY 60 OUT 



Figure 7* Sample Page# Diabetes Index 
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essentially the original Luhn format, and it should be noted in this connection that while 
Luhn recognized that the origin of the KWIC principle lay in the making of concordances, 
he claimed in particular the use of machines to achieve speed, completeness, accu- 
racy, and a novel format. 1/ 

The most common variant to the center position for the indexing window (or keyword 
position) is at the left or the beginning of the line. Netherwood'^ selected bibliography of 
logical machine design, which is probably the first of the modern permuted title indexes 
to appear in the open literature, used the left- most positions for the index entry word in 
each title listing. Slant marks were also printed to shew the breaks in the normal order ' 

of the title (Netherwood, 1958 [437]). A proposed subscription service, advertized in * 

1958 but never actually brought into operation, would also have used the left-hand ' 

position.]*/ * t 

In these left position examples, the keyword -in -context principle is kept only | 

partially intact since the word in the index position is directly adjacent to its most . 

specific right-hand context, not to its left-hand. In variations such as developed at 
Stanford Research Institute, however, the index word is extracted from its context and 
printed separately in the left-hand margin, with the title in its normal order printed to j 

the right. This type of variation has been called "KWOC”, for keyword -out -of-context, l 

and is illustrated in Figure 6, which shows the format developed by C.E.I. R. , Inc. for / 

the OTS index to U. S. Government Research Reports . ! 

Table 1 lists a number of KWIC in. .... projects for which computer programs are or I 

might be made available to interested additional users. Computer programs have been I 

written specifically for the IBM 650, 704, 1620, 709, 7090, and 7094 data processing .* 

systems, the G.S. 225 computer, the Deuce Computer in England, the UNIVAC 1103 and | 

1107 systems, and the Japanese computer JEIPAC, among others. In addition, some ( 

permuted title indexes are produced manually, or with the use of simple business office 
machine equipment. For example, an index to the AIBS Bulletin for 1951-1961 has been 
so produced by the American Institute of Biological Sciences. | 

y I 

Private communication, excerpt of letter from H. P. Luhn to C. L. Bernier, j' 

December 27, I960: 11 With respect to the origin of the KWIC Index, you are, t 

of course, right that it is a form of concordance, as stated in my original 
paper. Furthermore, keyword indexing has been practiced in various forms 
as far back as a hundred years ago. All of these methods were, however, de- 
pendent on manual effort. I would say that the significance of the present KWIC I 

Index is based on the fact that it is produced automatically by machine, affording • 

speed of compilation, accuracy and completeness. As far as the particular format 
of the Index is concerned, this is novel to my knowledge, in accordance with in- 
formation I have been able to ascertain from others. 11 i 

2 / 

"PILOT--a permutation index to this month’s literature”, see p. 8 and Figure 1. ' 

A left- most window full -title format was developed at Stanford University in co- * t 

operation with the IBM San Jose Laboratories. It has been applied by the Com- j 

putation Center to the titles of computer programs for the benefit of us ere of the 
Program Library Computation Center, Stanford University, ”The KWIC Index”, 

1963. See also Marckworth, 1961 [393]. 

I 

3/ 

National Science Foundation's CR&D Report No. 11, [430], p. 10; Janaske, 1962 1 

[299], Shilling, 19^3 [550] and [551] . 1 
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Table 1. KWIC Type Indexes and Programs 



Issuing Organization 
and/or Investigator 



Name of Index or Program 



When Issued 



Format 



References 
and 

Computerj Remarks 



Service Bureau Corp- 
oration - H. P. Luhn 



"Bibliography and Auto -Index, 
Literature on Information Re- 
trieval and Machine 
Translation 11 



First edition 
Sept. 1958 
Second edition 
June 1959 



2-column, 60 -char- 
acter single line title, 
center window 



IBM 709 



Basic Luhn KWIC 



Chemical Abstracts 
Service 



Chemical Titles 



S emi -mo nthly 



Standard Luhn IBM 



1401 



Chemical Abstracts 
Service 



Chemical Biological 
Activities 






Bi-weekly -1st 
issue Sept. 
1962 



Single column 
Center window, 120- 
character line, upper 
and lower case, 120- 
character 1403 
printer 



1401 



Biological Abstracts 



B.A.S.I. C. 



Semi -mo nthly 



Standard Luhn IBM 



1440 



Modified Luhn pro- 
gram: shading is 
used as an aid in 
scanning. 



Biological Abstracts 



Biochemical Title Index 



Monthly 



Luhn, Chem . 
Titles Formats 



1440 



Bell Telephone Lab- 
oratories 



-Index to the Literature of 
Magnetism 

-BTL talks and papers 



Annually 

Annually 



Single column, 120 
character line, 
center window 



7090 



BE-PIP Program 
available through the 
SHARE organization 



All-Union Inst, for 
Scientific and Tech- 
nical Information 



rr 



. an index of 
the 'Chemical 
Titles 1 type. 11 



Mikhailov, 

[418] 



IV 62 



O 



Table 1. (cont.) 



w 

o 



Issuing Organization 
and/or Investigator 



Name of Inde:: or Program 



When Issued 



Format 



Computer 



References 

and 

Remarks 



American Bar 
Foundation, Bobbs 
Merrill 



Index to Current State 
.Legislation 



Initial issue, 
1963 



Eldridge and Dennis, 
1962 [ 183] 



American Diabetes 
Association 



Diabetes -related 
Literature Index 



First of 

propose.d series, 
covering litera- 
ture for I960, 
issued 1963 



2 -column, left 
window, KWOC, 
full citation for 
each entry. 



GE-225, 

Western 

Reserve 

program 



American Meteoro- 
logical Society 



Meteorological and 
Geoastrophysical Titles 



April 1961, 
Oct. 1961, 
Jan. 1962 and 
following 



Standard Luhn 
IBM 



704 



Includes a Systematic 
UDC -Sub je ct Heading 
Index as well as 
modified KWIC. 



Armour Research 
F oundation 



Key words in context 
(reports received in 
document library) 



Two column, 
60 -character 
line, center 
window 



1103, 

1107 



ASTIA (Defense 

Documentation 

Center) 



Keywords -in-context 
title index. A list of 
titles for ASTIA docu- 
ments not previously 
announced. 



Irregularly No. 
1, Oct. 1962 
No. 2, Feb. 
1963 



IBM 



English Electric 
Company 



KWIC -type 



Deuce 



See Black, 1962 [65]; 
Dowell and Marshall, 
1962 [ 159]. 











Table 1. (cont. ) 



References 



O 

ERIC 



<J1 



Issuing Organization 
and/or Investigator 


Name of Index or Program 


Vivien Issued 


Format 


Computer 


and 

Remarks 


General Electric 
Computer Dept. 
Phoenix 


General Bibliography on 
Information Storage and 
Retrieval 


* - ■ 


Single column, 
center window 


GE-225 




Gmelin Institute 


Information Journal for 
Atomic Energy 








See Koelewijn, 
1962 [330]. 


Japan Information 
Center of Science 
and Technology 








JEIPAC 


"The JEIPAC, a 
transistorized infor- 
mation processing 
machine. . * has also 
been programmed 
for automatic index- 
ing designed after 
the IBM KWIC index- 
ing system." CR & 
D No. 11, [430], 

p. 120-121. 


Lockheed Missiles 
and Space Division 


KWIC Index of Reports 




Modified Bell 
Labs . 


1401/ 

7090 


See Carroll and 
Summit, 1962 [102]. 


Mimosa Frenk 
Foundation for 
Applied Neuro- | 

chemistry 


KWIC Index to Neuro - 
chemistry 


August 1961 


Standard Luhn 
IBM 


IBM 




M, L T. 


KWIC Index to The 
Science Abstracts of 
China 


1st Edition, 

December 

I960 


Standard Luhn 
IBM 




704 





1 





Issuing Organization 
and/or Investigator 


Name of Index or Program 


Table 1. 
When Issued 


(cont. ) 

Format 


Computer 


References 

and 

Remarks 


■ 

National Bureau of 
Standards 


A Bibliography of Foreign 
Developments in Machine 
Translation and Informa- 
tion Processing 


July 1963 


Single column, 
120 -character 
line, center 
window 


7090 


Byproduct input 
from Flexowriter 
tape, citation, data 
including upper and 
lower case, paper 
tape to punched 
card conversion. 
Walkowicz, 1963 
[629]. 


i ■ 

National Bureau of 
Standards, WVW. 

Y ouden 


-Index to the Communi- 
cations of the ACM 
-Indeic to The Journal 
of the ACM 




Single column, 
120-character 
line, center 
window 


7090 


Y ouden, 1963 [659] 
and [ 660] . 


! fo 

! Radio Corporation 

of America 


Significant Words 
Indexed From Title 






RCA 

301 


Unpublished report 

by D. Clime ns on “ 

and M. Bechman 


Stanford Univ. IBM 

| San Jose Labs. 

i 


Dissertations in Physics 


1961 


Keyword-out- 
of-context, 
left window 


IBM 


Marckv/orth» 1961 
[393], 


I 

| Union Carbide Oak 

; Ridge National Lab- 

I oratory Libraries 

j 


Key Word Index Labora- 
tory Reports Received 
Semi-annual Index 
January- June 1963 


1st issue 1963, 

monthly 

thereafter 


Bell Labs. 
System 






, U. S. Atomic Energy 

Commission, Division 

I of Technical Information 


Index to Conferences 
Abstracted in Nuclear 
Science Abstracts 


December 

1963 


Bell Labs. 
System 






l 

o . ’ * 



Table 1. (cont. ) 



References 



(J! 

W 



ERjt 



Issuing Organization 
and/or Investigator 


Name of Index or Program 


When Issued 


Format 


Computer 


and 

Remarks 


University of Califor- 
nia Lawrence Radia- 
tion Laboratories 


Key-word “in-title (KWIT) 
index for reports 


Various 

issues 


Single column, 
120 -character 
line, center 
window 


1401/7090 


Records can b • 
machine searched 
with and, or and 
n>t logic 


University of 
California, 
Lawrence Radiation 
Laboratories 


Unclassified Reports 
Titles List 


Biweekly 


Single column, 
120 -character 
line, center 
window 


1401 

4 


By-product pre- 
paration from Flex- 
owriter of library 
cards. Turner and 
Kennedy, 1961 [614]. 


University of Kansas 


Kansas Slavic Index 


Initial issue 
July 1963 


60 -character 
Modified 
Chemical 
Titles 


1401 


Farley, J963 [192].. 


University of Kansas, 
University of Okla- 
homa 


(Space Law collection) 






1401 


’‘Current research 
and development. * . 11 
No. 11, p. 44 & 171. 


Western Periodicals 
Company 


Permuted Indexes to 
Scientific Symposia 


As available 


Standard Luhn 
IBM 

1 




Advertised regular- 
ly in various period- 
icals, e.g. , Special 
Libraries 





In addition to the regularly issued KWIC indexes by Biological Abstracts, Chemical 
Abstracts Service, the American Meteorological Society and othors, a large number of 
special field, one time, or limited collection coverage indexes of this type have been and 
are being produced both in the United States and in other countries. Well-known examples 
include the programs developed at the Lawrence Radiation Laboratories, University of 
California, which simultaneously produce catalog, cross-reference a;id subject authority 
cards, 1 / and the programs developed at the Bell Telephone Laboratorie s from 1959 on- 
ward (Kennedy, 1962 [310]). 



Other KWIC indexing efforts cover a wide variety of subject matter. In the field 
of law, applications of KWIC type indexing include work on the legislation of the 50 states, 
a joint project of the American Bar Foundation and the Bobbs- Merrill Company (Eldridge 
and Dennis, 19^2 [ 1831 , 1963 [ 182]), the ninth annual edition of the Index to Legal 
Theses an d Research Projects, July 1962, (Eldridge and Dennis, 1963 [182]}; and a co- 
operative program between the libraries of the Universities of Kansas and Oklahoma to 
prepare an index to the latter's "Space Law" collection. 1J In i960, the KWIC Index to 
the Science Abstracts of China was prepared for an AAAS Symposium, (Henderson, I9bl 
L 263]; Farley, 1963 [192])^ At the University of Kansas Library also, the Kansas 
Slavic Index is being produced, with coverage of 3,000 articles from more than 200 Slavic 
journals. 1/ In the computer technology field, Youden (1963 [659] and [660]) has com- 
piled KWIC type indexes to both the Journal of the ACM and the Communications of the 
ACM and the Western Periodicals Company offers KWIC indexes to the proceedings of 
the Joint Computer Conferences as well as to the proceedings of other conferences and 
symposia including those in fields of electronics, aerospace and quality control. 41 A 
special-purpose application is in the use of a KWIC-index in lieu of cross-references in 
a revised edition of Current Medical Terminology. 5/ 



Examples of KWIC indexing projects abroad include work at the Japanese Informa- 
tion Center of Science and Technology, Tokyo, an index "of the 'Chemical Titles' type"- 
at the All-Union Institute for Scientific and Technical Information (VINITI) U. S. S. R. ,7/ 
an information journal for the atomic energy field being prepared at the Gmelin 
Institute, (Koelwijn, 1962 [330]), and work in Great Britain both at the English Electric 
Company §J and the IBM British Laboratories (Black, 1962 [ 65]). 



y 

2 / 

3/ 

4/ 

5/ 

6 / 

7/ 



Nation Science Foundation's CR&D Report, No. 11, [430], p. 42. 
Ibid, pp. 44 and 171. 

Ibid, p. 43; University of Kansas, 1963 [307]. 

See advertisements in journals such as American Documentation . 

Gordon and Slowinski, 1963 [236], p. 55. 

1 

National Science Foundation's CR&D Report, No. 11, [430], p. 120. 
Mikhailov* 1962 [418], p. 50. 
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Dowell and Marshall, 1962 [159], p. 323; Black, 1962 [65], p.316. 



I 



Trans- Canada *ir Lines — is using a KWIC System, and at the EUR ATOM IS PR A labora- 
tories a KWIC type program has t *en developed with up to 600-character context and a 
left -most indexing position. 2/ 

3. 1. 2 Advantages, Disadvantages and Operational Problems of KWIC Indexing 

Luhn's original acronym, KWIC, is peculiarly apt for permuted title word indexing. 
As both proponents and critics have noted, the resulting product may be relatively crude 
in terms of indexing quality, but it is quick. The speed achievable both by elimination of 
human intellectual effort and by use of machine (especially computer) processing is indeed 
the major single advantage of this type of automatic indexing. Closely related, however, 
are the advantages of currency of announcement and the availability of these indexes for 
individual use. 



Some typical claims with respect to speed and currency are as follows: 



"The permuted index was invented as a means of adequately controlling 
(essentially, of indexing) the literature without further intellectual effort, 
and thus eliminating indexing delays." U 

"The great merit of this particular method. . .is that it enables information 
concerning new articles to be made available very much more quickly tnan if 
there were the inevitable delays of human abstracting and indexing." i/ 

"In spite of the disadvantages which ar r pointed out, perhaps the gieatest 
advantage is the timeliness and the speed with which permuted- title indexes 
can be prepared. " 5/ 



Specific examples of high speed are given by Biological Abstracts, where one hour's 
computer time suffices to prepare and arrange entries for over 150,000 items. f>J 
Kennedy reports for the Bell Laboratories System that: 



"Editorial scanning is very fast; only several lines of print must be read for 
each report and the required text markings are trivially few. Keypunching, 
the largest single task, takes about two minutes per report. . . Main- frame time 
. . .was 12 minutes for 1703 reports." U 



1 / 

Simons, 1963 [ 556] , p. 34. 

2 / 

Meyer-Uhlenrled and Lustig, 1963 [417], p.229. 

y 

Tukey, 1962 [611], p. 13. 

y 

Cleverdon, 1961 Cl25], p. 108. 

5/ 

Janaske, 1962 [299], p. 3. 

y 

See Biological Abstracts, 36:24, p. xii. 

7/ 

Kennedy, 1961 [311], p. 123. 
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Skaggs and Spangler claim: 

11 The most obvious advantage o£ permuted indexing by computer is speed. In 
a test of one permpted indexing system, input of 3, 000 punched cards contain- 
ing titles and running text produced a permuted significant word index of 
12, 190 index entry lines, with approximately 85 minutes of computer time 
required for the permuting and sort operations. The output was printed at 
some 500 lines per minute. . . 11 1/ 



In many cases, greater speed and timeliness are achieved at significant^ lower 
cost. This is particularly true if the preparation of the input -- title, author, item 
identification and ocher descriptive cataloging information — serves multiple purposes 
from a single keystroking operation. Thus, the MATICO System provides from a single 
input (1) KWIC indexes as required, (2) selective dissemination notices to potential users 
of new acquisitions, (3) records on magnetic tape for the information retrieval file, and 
(4) book catalogs covering specialized areas of the collection, all at a net savings over 
previous methods of $0. 39 for each title processed.^ 

Another advantage which is typically claimed for KWIC indexes is the use of the 
author's own terminology. The display of different words as they have been used in 
title context with any word looked up introduces "suggestiveness" so that different mean- 
ings and different browsing clues are shown. Kennedy makes the following typical points: 

"The use of the author's own terms- -the alive currency of new ideas — rather than 
the considered reshapings to the indexing system may often be of advantage. The 
automatic generation as index entries of all the separate words in multi -term 
concepts is definitely so. Access is direct, under any one of the component 
terms, in the unrestricted manner of Uniterm indexing. And context minimizes 
false drops; the author has supplied the term coordination. " 3/ 

Others, however, consider some of these same factors to be definite disadvantages. 



In general, even among enthusiasts of KWIC, there is more agreement as to the 
values of the technique as a device for current awareness scanning and as a dissemination 
index than for its use for more extensive searching. It was, ir fact, primarily as a 
dissemination index that Luhn first proposed the KWIC technique. He pointed out that 
such indexes could be prepared with minimum effort and be ready for dissemination in 
the shortest possible time, justifying publication by inexpensive printing means. He also 
noted the following additional advantages: 

17 5 

Skaggs and Spangler, 1963 [557], p. 30. 



y 

Carroll and Summit, 1962 [102], p. 4. 
3/ 

Kennedy, 1962 [310], p. 184. 
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"1. Because of the mechanical method of preparation, more information 
may be displayed than would have been practicable by conventional 
means. 

"2. Keywords -in -context permit the cross -cor relation of subjects to an 
extent not realizable by conventional procedures. M 1/ 

The most common type of complaint against the KWIC indexing method is, as we 
ha’ r e noted earlier, identical with that which is applied to word indexing in general--the 
lack of terminological control. Where the indexing terms are restricted to those used by j 

the author himself, in his title or even full text, there arise many serious problems of | 

synonyms, near-synonyms, homographs, neologisms, and eponyms. The effects of i 

machine inability to resolve these problems are redundancy, scatter of references 
throughout the index, "haphazard groupings", 2/ and retrieval losses because the user is 
forced to guess at the terminology the author actually used. These problems are 
severely aggravated when only the title is used as the basis for index- word extraction. 

Thus, a first and major question in attempting to appraise the effectiveness of KWIC- 
indexing techniques is that of the adequacy of titles alone as the source of subject content 
clues. Spurred on at least in part by the existence of KWIC-type indexes, several 

investigators have studied this question, with somewhat different results. Williams has | 

explored for some years the possibilities of developing systematic procedures for title | 

elaboration, especially making explicit information that is implied. Her conclusions are | 

that indexing by title and direct elaboration of the title would produce index information | 

equivalent to that found in Chemical Abstracts for about 50 percent of the documents j 

studied, but that other procedures would be required for the remainder. 4/ J 

Specific studies of title adequacy for a particular journal or field have been under- | 

taken by both the American Institute of Physics and the Biological Sciences Communica- 
tions Project. In the A. I. P. experiments, graduate physics students were asked to 
locate from limited clues certain specific articles appearing in The Physical Review, and 
search times were checked for their use of permuted title and other indexes. Another 
group of students compared the subject index entries in Physics Abstracts and Chemical | 

Abstracts with the words in the titles of 25 papers from The Physical Review. In the case j 

of Physics Abstracts , 69 percent of the entries for these papers were found in the words 
of the title and 63 percent of the titles contained all of the information supplied by the 
set of index entries. In the case of Chemical Abstracts, the corresponding percentages 
were 47 and 23 . 5 ] These latter findings, for the chemical index, are closely corroborated 

IT 

Luhn, 1959, [38l], p.295. * 

2 / 

Olney, 1963, [458], p. 44. 

3/ 

See, for example, Dowell and Marshall, 1962, [159], p. 324; "This problem of , 

'conceptual scatter 1 becomes a nightmare when highly idiosyncratic author 

language is used as a basis for subject indexing. " 

4/ 

Williams, 1961 (643], pp. 3*61-363. 

5/ 

Maizell, i960 [392], p. 126. 
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Bernier and Crane who report that for the non -organic chemistry items covered by 
Chemical Abstracts, 34 percent of the entries can be derived from the UJ.es. U 

With respect to the Biological Sciences Communi cations Project studies. Shilling 
reports as follows: 

"Titles of scientific articles are being utilized at present in a great many ways 
tinder the general assumption that there is a positive correlation between the 
title and the content of the article. A study was undertaken to analyze the 
accuracy of titles in describing the content of biomedical articles. It was conducted 
in two parts. Xu part one, a group of scientists were asked to predict the content 
of selected scientific articles, in their area of interest, from the title, the author's 
name, and the name of the journal in which it appeared. The results of the first 
phase of the study on the first trial journal were so diverse as to make analysis 
impossible, and this part of the study was not pursued further. From this small 
segment of the 3tudy it appears that scientists are deluding themselves when they 
search by title only and then decide what they wish to read. 

"In the other half of thfs experiment, the article without title, author's name, or 
journal name was sent to 20 scientists, selected as experts in the scientific field 
of the article, who were asked to write a meaningful title. Fifty articles were 
used, five from each of ten selected biomedical journals. From this part of the 
study it is apparent that if the article is in a field which is relatively well 
standardized and has an accepted vocabulary, it is possible for a group of titlists to 
agree remarkably well on an appropriate title. However, if the article is loosely 
organized, contains more than one subject, or is in a specialty in which there is 
no standard vocabulary, then titling scientists fail to agree to a rather alarming 
extent. "2/ 

Other studies involving the question of usefulness of titles alone for indexing purposes 
include those of Doyle, Lane, Montgomery and Swanscn, O'Connor, Ruhl, Swanson, and 
White and Walsh, among others. Doyle chicked the retrieval loss likely to result from 
the synonymity- scatter problem for a permuted title index compiled in 1958 to the internal 
reports of the System Development Corporation. He found, for example, that for 12 
direct references to McGuire Air Force Base, there were one to "New York Air Defense 
Sector", two to "New York Sector", ten to "NYADS" and five to "N. Y. Sector". 
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Bernier and Crane, 1962 [56], p.120. 
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Shilling, 1963 [55l], pp. 205-206. 
3/ 

Doyle, 1961 [166], p. 
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Ruhl (1963 [506]) found that between 50 and 90 percent of author -prepared titles (the 
variation depending on subject field and other circumstances), did fully reflect the index 
terms assigned to these documents by human indexers. Lane and White and Walsh have 
also made studies directly related to the question of KWIC index effectiveness. The latter 
two investigators report only 52 percent retrieval effectiveness for a permuted title index 
to the Abstracts of Computer Literature, l' ; 62, which they attribute to the changing 
terminology in the still new field of computer technology. _ Lane made counts of titles 
that would be "acceptaole" and those that would not for a KWIC index for 50 titles drawn 
from each of 10 published indexes. He concluded that, if there were judicious pre-editing, 
technical articles in the technical subject indexes coi Id be quite adequately covered, and 
papers in the fields of law, business, and the humanities somewhat less satisfactorily so, 
!ut that for the .letterial indexed in the Reader's Guide to P eriodical Lite rature , the KWIC 
technique would fail 58 percent of the time. 

Montgomery and Swanson have studied, as has O'Connor is even more detail, the 
adequacy of "machine-like indexing by people". Montgomery and Swanson took as their 
test corpus the September i960 issue of Index Medicus and found that for 4, 770 items, 

85.8 percent contained either the word itself or a synonym for the subject heading 
assigned, slightly over 11 percent did not, and in the remaining cases the investigators 
could not clearly decide. They concluded, therefore, that: "Most of the articles studied 
could have been indexed by machine on the basis of machine 'inspection 1 of article titles 
alone." U O'Connor, however, typically reports that of a random sample of 50 papers 
manually indexed under the term "Toxicity", five had titles which contained the word 
"toxic" or the word "toxicity" and 34 had titles which were not even indirectly connected 
with the term. ([’443], [444], [445], [447] and [448]). With respect to the Montgomery- 
Swanson conclusions as such, Carlson raises the further critical questions of over- 
assignment and false drops and suggests that: "a simple machine processing cf titles 
would give ns way too much or practically nothing. " 4/ 

Research activities at the American Bar Foundation have included checking of 
KWIC type indexing of several thousand legal articles with the subject headings assigned 
under the "Index to Legal Periodicals" system (Kraft, 1962 [333]). It is reported that: 



1 / 

White and Walsh, 1963 [639], p.346. 

2 / 

Lane, 1964 [345], p. 46. 

3/ 

Montgomery and Swanson, 1962 [42l], p. 359. In another study (1962 [534] , p.468), 
Swanson reports findings for several thousand entries in classified bibliographies 
where approximately 90 percent of the sampled items contained title words that were 
identical, or similar in meaning , to the subject headings under which they were 
indexed. He notes, however, that similar results could have been produced by 
machine processing with the significant proviso that the machine have available an 
adequate synonym dictionary or thesaurus. 

4/ 

G. Carlson, 1963 [lOO], pp.328-329. 
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"Interpretation of data revealed, among other things, that 64.4 percent of the 
title entries contained as keywords one or more of the ILP subject heading words 
under which they were indexed, and 25# 1 percent contained logical equivalents. 
The remaining 10. 5 percent of the title entries had non-descriptive titles. "A/ 



The difficulties with titles as sources of the indexing information stem from at least 
three distinct types of determining factors: (1) the language habits, background, 
interests, and idiosyncracies of the author; (2) the interests, familiarity with the subject 
matter, language habits, imagination, and idiosyncracies of the user, and (3) factors 
largely extrinsic to either the particular author or the particular user. In the first case, 
we find especially the problem of the witty, punning, deliberately non- informative title, 
the so-called "pathological title", Janske gives the provocative example, in the literature 
of information selection and retrieval itself, of "The Golden Retriever", £j Even in the 
non-pathological case, however, there is the serious question of whether the author him- 
self is likely to be a good indexer, 2/ 



On the user side, the normal critical problems of "bringing the vocabulary of 
indexer and searcher into coincidence" (Bernier, 1953 C 55]) are aggravated by the facts 
that the user of KWIC must anticipate the terminology used by a large number of 
different "indexers" (i, e. , the authors), that title words spelled the same but with quite 
different meanings in different special applications are grouped together in the same 
place in the index, and that the same concepts may be expressed in quite different 
phraseology depending on the author's, rather than the user's, field of specialization. To 
these aggravating circumstances there must be added in turn the psychological accept- 
ability to the individual user of the scatter and redundancy, to say nothing of the format 
and legibility, of a particular published index. 



Such factors affecting the particular user will of course vary with the nature and pur- 
post of his search, Kennedy points out, for example, that the location of a document from 
only a single clue, a single title word, is particularly easy with a permuted title index 
and he emphasizes that the "index purpose, use, size, statement and array are other 
factors of considerable moment in judging the value of title indexes 1 ’, — 
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National Science Foundation's CR&D Report No, 11, [430], p, 62. 

2 / 

Janaske, 1962 [299]. p. 4. 

3/ 

See, for example, a report on a confertnce on better indexes for technical literature, 
ASLIB Proceedings, 13:4, April 1961, with a number of statements on the author as 
a poor indexer. See also Crane and Bernier, 1958 [144], p, 515: "Not even authors 
are qualified to index their own Work unless they are equipped for the task by train- 
ing and experience", 
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Kennedy, l96l[31l], p. 125, 
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A major question in the area of user acceptability, however, is that of the adequacy 
of title alone to tell the searcher whether or not a specific document is relevant to his 
query or intt'est. A number of investigators, both documentalists and user-scientists, 
suggest that this is rarely the case. 2/ In fact, for many users, titles alone provide only 
a negative searching device- -in an announcement bulletin z? abstract journal the user's 
scanning of titles merely tells him whether or not he should read the abstract and then 
perhaps go on to the paper itself. 

It is for reasons of this type, in all probability, that Montgomery and Swanson found 
less effectiveness of titles on Relevance- judgment tes>.s than might be suggested by their 
more optimistic findings as to the success of machine procedures for replicating human 
subject heading assignments. Whereas they have claimed that about 90 percent of test 
items could have been as successfully indexed by machine as by manual procedures, 
(Montgomery and Swanson, 1962 [ 421]; Swanson, 1962 [584]), they have also reported 
that: "Comparison of title relevance judgment with judgment based on full text examina- 
tion indicates that titles are only about one- third effective (i. e. , two- thirds of the relevant 
articles would be judged irrelevant) as the basis for estimating the relevance of the 
article to a given question"..?/ They go on to suggest, therefore, that . .indexing should 
be based on more than titles and. . . a bibliographic citation system should present to the 
requester something more than titles. " — / Similarly, Jahoda reports in an analysis of 281 
actual seaich requests at Esso Research and Engineering that only two-thirds could have 
been answered with a shallow index based on titles and major section headings of the 
documents and that answering the remainder of the requests would have required an index 
of considerable depth, 2/ 

The obvious factors affecting the utility of titles as the source of indexing- searching 
clues include, first, the limitation of most titles to the principal subject matter, the main 
topic or topics of the document. The display of title context does to some extent provide 
for modifications of the topic to the special aspects treated, but it is of course obvious 
that a title cannot possibly provide clues to subject content not implied in the words of that 
title. In many cases, the potential user wants information contained in the paper, or even 



1 / 

See, for example. Atherton and Yovich, reporting on evaluations by physicists of 
experimental citation indexing, 1962 [26], p;22: "The reliance on titles of papers 
for retrieval purposes was not sufficient"; Levery, 1963 [359], p. 235. "Titles are 
usually insufficient toAirnish a correct index to the text"; Hocken, 1962 [274], p. 93: 
"The titles were not explicit enough"; Crane and Bernier, 1959 [145], p. 1053: 

"Lists of titles can be prepared rapidly, but they are inadequately useful in selecting 
articles of interest, and they provide little or no directly usable information"; 
Dowell and Mars hall, 1962, [159], p.324: "Frequently titles either lack sufficient 
detail or are in fact misleading"; Connolly, 1963 [136], p. 35: "Most titles are 
inadequate as descriptions of the contents of papers. " 
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Montgomery and Swanson, 1962 [42l], p.364. 
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in its appendices, which was not the principal concern of the author and may not even have 
been considered significant by him. The claim that the author, who knows his own subject 
best, has already indexed his paper best by his choice of words and emphasis in text, and 
especially in his title, is pertinent only to that main subject to which he addresses himself, 
not to the other potentially useful information which he may also disclose. 

Other extrinsic factors affecting title adequacy and hence the effectiveness of title - 
indexes are the size and the relative homogeneity or heterogeneity of the collection or set 
of documents so indexed, the breadth or narrowness of the subject field or fields covered, 
the time period covered and whether for one or many fields. Whether or not material in 
more than one language is included is a special factor. These various factors interact in 
various ways, usually with disadvantageous effects when even the most "nondescript 11 
human indexer (that is, one who accepts only words from the text itself) is replaced by 
"a keypunch operator whose job it is to convert the keywords into machine -readable form, 
and a machine whose job it is to assimilate machine -readable text and print out its per- 
mutations with each significant word serving as an access point. " 1/ 

The difficulties of subject scatter, synonymy, homography, redundancy, and the 
like, however, will also occur in human indexing that relies heavily on title only, which 
is perhaps more frequently the case than is generally recognized, zJ just as much as for 
machine -gene rated indexes involving the permutations of keywords in titles. Such dis- 
advantages must therefore be balanced not only against the advantages of speed, timeliness, 
having an index announcement tool personally available at low cost, and the like, but also 
against the probability of obtaining as useful a tool within the limits of available human 
indexing resources and justifiable costs. Clever don, for example, comments as follows: 

"There are those who would say that this [KWIC] can in no way be called indexing, 
and that the value of such indexing must be very much lower than that done by 
intelligent trained human beings. This is a comfortable thought, but such small 
evidence as is at present available makes it appear doubtful as to whether it is 
entirely true. This is not to say that a human being cannot do a better job, but it 
certainly appears likely that the cost of employing a human being to do it is of 
doubtful economic value. " 3/ 



U 

Herner, 1962 [266], p. 4. 

2 / 

See, for example. Moss, 1962 [425], p. 39: "I am convinced that a great many of 
the UDC and other numbers which are provided on millions of cards in technical 
libraries up and down the country, and which look so erudite, are, in fact, no more 
than cards transliterating titles, with occasionally similar transliteration of a few 
randomly chosen words from the abstracts as well. . .We are. in effect, already 
largely using title indexing and complicating it unnecessarily by magic numbers. " 
See also Crane and Bernier, 1958 [144], p. 514: "Some indexes to periodicals, 
particularly word indexes, are merely indexes of titles of papers or of abstracts. " 

3 / 

C^everdon, 1961 [125], pp.107-108. 



It is also of interest to note, moreover, that the very existence of machine -generated 
permuted title indexes should greatly increase the likelihood that authors will use better 
and more useful titles. _1/ At a seminar on word and vocabulary byproducts of permuted 
title indexing held at Biological Abstracts neadquarters on October 8, 1962, Rigby of 
Meteorological and Geoastrophysical Abstracts reported informally that as of that time 
there was already discernible improve men. i ^ titles covered by their KWIC index. In the 
same year (1962), Tukey similarly stated that; "Chemical Titles has been heavily enough 
used to affect the construction of titles of papers on chemical subjects. 11 2 / Instructions 
to authors of the previously mentioned "Short Papers" J/ for the A. D. I. 1963 Annual 
Meeting specified that at least six significant words should be included in their titles and 
nearly all authors did in fact comply. Two of the "Short Papers" are specifically directed 
to the topic of improvements that authors can make in writing their titles (Brandenberg, 

1963 [80]; Kennedy, 1963 [312]). 

Instructions of this type can be effectively used for situations where all authors are 
under the same administrative control, as in the internal reports prepared in a single 
organization. This type of situation, incidentally, is one for which KWIC proponents are 
often most enthusiastic (Kennedy, 1962 [310]; Black, 1962 [65]; Linder, I960 [362]). 

Finally, there is considerable promise that pressures brought to bear by journal editors 
of the publications of professional societies, notably the American Institute of Chemical 
Engineers and other cooperating member societies of the Engineers Joint Council, will 
result in improved adequacy of titles and thereby increased effectiveness of title word 
indexes. 

Certain other disadvantages of KWIC indexing techniques, however, relate specif- 
ically to operational problems and requirements in the machine production of these indexes. 
There is, first, the problem of the amount of context that is usually' displayed — that is, the 
question of line length- -and the related problems of title truncation and wrap-around. As 
Kennedy notes: "Progressive shifting of the title to bring a given word to the indexing 
column frequently causes portions of the title to exceed the line space available, first at 
the right margin, then the left, or even both simultaneously. "4/ A case in point is the 
perhaps apocryphal "EROTIC TENDENCIES AMONG TRAPPIST MONKS" where 
"AXHEROSCL" had been dropped off at the left. 

For multi-column KWIC indexes, in particular, where the line length is typically 
58-60 characters, "much of the relevance is lost because the reader sees the wrong slice 
of the title". ZJ The Bell Laboratories KWIC index, 6 / Chemical-Biological Activities , 7 / 
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See for example. Black, 1962 [65], p. 317; Youden, 1963 [658], p, 332. 
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and Youden' s indexes to ACM papers (1963 [659] and [ 6603) illustrate single-column 
formats that alleviate this problem by extending the title line to 103-106 characters, ex- 
clusive of the identification code. Youden has calculated that for the titles in the field of 
computer literature which he analyzed 30 percent of the titles would have been truncated 
in 60- character title line formats, but that only 2 percent would have been chopped by 103- 
character title length limits. ]J 

A second disadvantageous effect of machine production requirements in most KW1C 
indexes is the tedious sequential scanning necessary because of the unbroken organization 
of the page format and the long blocks tha* occur for frequently occurring word entries. 
Doyle (1959 [168], 1961 [166]) has investigated this problem of block length and suggests 
either that alphabeti zation be carried out to theiwords following those in the indexing 
window or that the entries in the block be permuted also in a second-order cycle. The 
latter suggestion has the advantage of facilitating any two-term coordinate indexing 
type of search, " be cause one can now look up directly any pair of subject words, regard- 
less of whether or not they cccur adjacently in a sentence. 11 U 

Redundancy in KWIC indexes, which aggravates the sequential scanning and the long- 
block fatigue effects, is in large part the result of difficulties in establishing the most 
appropriate bounds for exclusion or ’’stop” lists. We have previously distinguished 
machine -gene rated indexes of the derivative type from certain of the machine- compiled 
indexes primarily on the basis that in the first case, the criteria for determining the 
significance of the keywords to be used as the index access points are applied auto- 
matically during the machine processing, even if the selectivity so achieved is only 
"negative selectivity. " ^ The amount of index entry redundancy, of too many entries 
and of irrelevant entries is, in simple KWIC indexing, a direct function of the length and 
contents of the stop list. 

In Luhn's original proposals for both KWIC and other types of automatic indexing, 
he pointed out the importance of the rules which must be established in order to 
differentiate the significant words from the nonsignificant. He says, for example: 

"Since significance is difficult to predict, it is more practicable to isolate it 
by rejecting all obviously nonsignificant or ’common’ words, with the risk of 
admitting certain words of questionable value. Such words may subsequently be 
eliminated or tolerated as ’noise 1 . A list of non** significant words would include 
articles, conjunctions, prepositions, auxiliary verbs, certain adjectives, and V’ords 
such as 'report', 'analysis', 'theory', and the like. " 
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W. W. Youden, 1963 [458], p. 331. 
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Doyle, 1961 [166], p. 13. 
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Artandi, 1963 [20], p. 15. 
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Luhn, 1959 [38l], p.289. 



Interesting variations are to be noted in the current practices of using stop lists. 

Some lists are quite short, and others extend to several thousand words. Parkins reports 
that a mere 14 words on the stop lists used for B. A.S.I. C. are responsible for 80 percent 
of the title lines that need not be printed, but that their original list of 200 stop words grew 
quite rapidly to more than 1, 000 now in use. 1/ Chemical Abstracts Service representatives 
reported in 1962 an initial list of about 1, 000 words which dropped to 300 at one time and 
then was increased again to the original level. Using a stop list of 82 words eliminated 
30 percent of a 42, 000- word corpus of internal reports at the System Development 
Corporation, (Olney, 1961 [456]). 

Critical questions in the establishment of stop lists relate to the problem of balancing 
the economics of the number of title lines to be printed and to be subsequently scanned 
against the loss of retrieval effectiveness if certain words are omitted from the search 
entry positions. How this balance should be achieved may vary from one subject field to 
another and between different organizations. In several regularly published KWIC indexes, 
the actual list used to exclude the presumably nonsignificant words is printed so that the 
uier can check before proceeding to actual search. Williams has suggested that each 
excluded word be listed once, in its proper alphabetic place in the index, if it occurs in 
the titles of the parti cular set of items being indexed. 3/ 

In general, however, not enough is yet known about the requirements of particular 
subject fields and parti cular types of organization to arrive at the most effective compro- 
mises in establishing exclusion lists for keyword indexing. Noting that stop lists in 
actual use vary from only a few function words such as prepositions and conjunctions to 
lists several hundred words long, Brandenberg points out that: 

"At the present state of the KWIC indexing art the selection of stop words appears 
to be largely arbitrary and a comparison of half a dozen stop lists shows that they 
have about two dozen words in common. " 

Kennedy and Doyle both specifically suggest that more research on the contents and 
effects of stop lists is necessary, (Kennedy, 1961 [311], 1962 [310]; Doyle, 1963 [ 162]), 
but Kennedy points out the ease with which the machine programs themselves can be used 
for modification of the lists. 5/ 
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seen. The collection of large amounts of text and their analysis will undoubtedly be 
the best way of determining the effects of these variables. " 
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Some of the reasons for keeping stop lists short, however, may reflect unnecessary 
programming difficulties. Turner and Kennedy have reported that in the SAPIR system a 
title word is compared only with the group of nonsignificant words that have the same 
number of characters, in order to reduce the machine time required for the exclusion 
list search. 2/ Skaggs and Spangler give an account of an exclusion list system developed 
for general text processing as follows: 

"A representative form developed by General Electric is composed of three groups 
of words, high frequency, special and standard. The high frequency words (25) 
occur most frequently in English text. A compression of approximately 35 percent 
will occur for most kinds of text when these 25 words are deleted. The special 
words are derived from the particular body of text being processed. The com- 
position of this group is left to the program user. Normally the words for this 
group are selected by making an Editing list in alphabetical sequence. The words 
appearing in the index position on the preliminary listing are then reviewed. 

"Standard words are words that occur with a relatively high frequency in most 
types of text and therefore are appropriate for a general purpose screen. In the 
GE program, 375 words are used in this group. 

"To minimize computer processing time, it is desirable that words in the Ex- 
clusion Dictionary be arranged in approximate order of their frequency of 
occurrence. " 2/ 

It should be noted, however, that in most cases stop list searches can be programmed in 
the form of so-called "logarithmic", "partitioning" or "bifurcation" searches in which 
the number of machine operations required is only loggN + 1, where N is the number of 
words in the list. 

The more words excluded, the fewer the title entry lines that must be included in 
the final index. This is a factor involving first of all the user in the sequential scanning 
he must do, where, as Coates has remarked, the retrieval effectiveness is usually in 
inverse proportion to the amount of such scanning required. Secondly, longer stop lists 
help to minimize the long block problem, since it is obviously the most frequently 
occurring title words that have not been excluded that cause the longest blocks of entries. 
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The important economic factor, however, is the total r umber of lines to be printed in the 
index, which is directly reflected in page costs. The effects of page costs, in turn, 
engender compromises in printing quality, such as page format and size of type. These 
are among the serious unresolved problems that affect user acceptance of KWIC indexes 
and involve questions of format, legibility, character sets, and size of the index. 
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In general, however, in the present state of the art of F.WIC indexing, the consensus 
seems to be that of qualified praise, especially for the early announcement and dis- 
semination applications. The KWIC index is recognized as responding to a definite need,— 
as having merit for fields in whiph more conventional indexes do not exist as well as for 
current awareness searching, 2/ as receiving excellent response from users "because 
they can take a handy booklet, sit down at a table and look under the words they know and 
use, and which they expect other engineers to use in titles. " 2/ Bernier and Crane, after 
considering comparative effectiveness data for subject as against word indexing, come to 
the following conclusions: 



"Title lists keyed by words have value for quick distribution and fast use since time 
is often a very important element in the obtaining of information. Such lists do not 
serve adequately for thorough searching. ... A title concordance may be more use- 
ful than would seem from the * • • data on index entries. However, it must obviously 
be incomplete, must ha^e many unnecessary entries, and would not prove suggestive 
enough to users who lack background in the subjects sought. " 



Additional benefits can quite readily be obtained by taking advantage of the biblio- 
graphic information once it is in machine -readable form to provide selective KWIC 
indexes (Balz and Stanwood, 1963 [28]f Black, 1962 [65,j; Carroll and Summit, 1962 [102]) 
machine retrieval of item citations by specified ke '/words. (Kennedy 1961 (3ll}) and 
selections of items geared to a Selective Dissemination of Information System (Barnes and 
Resnick, 1963 [36] ; Balz and Stanwood, 1963 [28}). Gallianza and Kennedy at the 
Lawrence Radiation Laboratory, for example, report as being under development 
programs for the IBM 1401 and 7090 computers which will combine KWIC type indexing 
features with the logical search operators "AND", "OR", and "IF" in order that users 
may specify subject searches in ordinary English language terms. L' 
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3. 2 Modified Derivative Indexing 

Some of the more obvious of the disadvantages of KWIC indexing techniques can be 
reduced if not eliminated by a variety of human and machine procedures. These include 
augmentation of titles to provide additional clues to subject aspects, manual post-editing, 
and synonym reduction through such devices as thesaurus lookups. 

The ink was scarcely dry on the first issues of a KWIC index before a number of 
suggestions for improvements, modifications, and augmentations were proffered in the 
literature. In fact, both Luhn and Baxendale .considered various possible refinements in 
their original proposals. The first systematic review of work in the field of automatic 
extracting--whether to produce indexes or abstracts, or both--was made by Edmundson 
and Wyllys in I96lffl8l]. They covered not only the KWIC type indexes as such, but also 
modifications suggested by Baxendale, Luhn, Oswald and others, and they themselves 
advanced a number of additional possibilities. Of the various modifications and refine- 
ments that have been suggested, the most obvious is that of title augmentation. 

3. 2. 1 Title Augmentation 

The machine -prepared index that was probably the first to go into productive opera- 
tion is actually one involving title and subject indicators rather than pure keyword- from- 
title permutations. The CIA project, beginning in 1952, is based upon manual pre- 
editing of the titles themselves, with the words to be picked up as index entries being 
underlined. In addition, it involves assignment of other words, descriptors or terms 
from a hierarchical classification schedule to indicate additional access points (Veilleux, 
1961 [624]. 

In later KWIC type indexing, the possibilities of improving effectiveness by pre- 
editing or post- editing to modify and expand titles have been suggested and explored by a 
number of investigators. The semi-automatic indexing reported by Janaske adds 
descriptive words or phrases in parentheses at the end of titles and uses them as 
additional indexing points (Janaske, 1962 [299]). At Biological Abstracts Service, 
improvements have been obtained (without sacrifice in the speed desired in order to index 
5, 000 abstracts twice a month’) by title supplementation as well as by an improved stop 
list and by post- editing word divisions and word recombinations. 1 / Titles for each of 
two 12,000-item bibliographies in the field of radiobiology are reported as being edited 
considerably before KWIC type processing. Other examples of modified derivative 
indexing based on title augmentation include Chemical Patents —{ the Applied Physics 
Letters indexing project at Oak Ridge National Laboratory, which provides for an author- 
prepared form to describe features of property and method not covered in the title, ,4/ 
and the KWIC Index to Neurochemistry ([420]), 
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To some extent, however, the use of human editors to improve the product of KWIC 
type indexing defeats the initial purpose of a quick and purely clerical or mechanical 
process. Thus, Dowell and Marshall argue: 

. . The basic permuted-title index can be substantially improved by editing and re- 
writing the titles before they are submitted to the computer. . . . But this of course, 
destroys the great advantage claimed for the permuted title index, 'that it is a 
purely clerical process'. Intellectual effort has entered the picture again and we 
are back where we started." U 

In the extreme case, the re-introduction of intellectual effort is in effect the re-introduc- 
tion of conventional human indexing, with the machine's role limited to that of compilation, 
as in the case of the "notation- of- content" statements prepared for NASA's STAR 
System (Slamecka and Zunde, 1963 Cs6l]; Newbaker and Savage, 1963 [430]). 

Kennedy suggests instead, therefore, that the augmentation might be accomplished by 
the authors themselves. However, it may then be pointed out, as by Bernier and Crane, 
for example, that the supplementation of titles before publication in order to provide 
suitable additional indexing words would be "av.*' * r ard Q oace- consuming and difficult". 
They continue: 

"It would call for the attention of index experts at the manuscript stage, which would 
delay publication and expand the total indexing effort. Furthermore, good, thorough 
indexes are based on the full information of abstracts and papers, not on their titles 
only. " 2/ 

An alternative method for title augmentation to improve the quality of KWIC indexing 
is therefore to establish procedures for machine selection of significant words from more 
of the text than just the titles alone. In fact, Luhn himself did not limit his technique as 
originally proposed to titles only but indicated that the process could be performed at 
various levels: title, abstract, or full text. 3 / in the 1958 permuted index to the ICSI 
preprints, entries were derived from titles, author's names, author affiliations, headings 
within the paper, figure and table captions, and sentences and phrases taken directly 
from text. 2/ Combinations of human and machine procedures based on sentences and 
phrases selected from text are described by Herner who cites a two- fold advantage: 

"First, it is not wholly dependent on the informativeness or lack of informativeness of 
titles and bibliographic citations, and, second, it affords a greater depth of analysis than 
is generally possible where titles or bibliographic descriptions alone are used." 5/ 
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Taking more text as the basis for automatic derivative indexing adds, of course, the 
problems and costs of keystroking additional input material. At the same time, most of 
the major problems of scatter of references, synonymity, redundancy and exclusive 
reliance on the author's own language and terminology not only remain but may quite 
probably be intensified. The problems of establishing suitatle rules for selection of 
significant words are aggravated, not only by the far larger number of different words to 
be processed, but because of unresolved problems in effectively relating length of index 
and depth of indexing to the length of the document. 1/ 

There are, however, a number of practical suggestions by v/hich machine augmenta- 
tion of titles might be accomplished. First is the invariant selection of words that are 
capitalized, other than those that begin a sentence. I '] As Wyllys points out, this type of 
selection criterion would emphasize proper names, and these in turn might be particularly 
valuable clues, especially in a military intelligence situation, kJ It has also been 
suggested that the selection criteria should depend on particular pre-specified contexts, 
such as being preceded by the words: "the results were. . . , ", "in conclusion ... ", and 
the like. 

A second type of machine selection procedure is the converse of the exclusion or 
stop list, namely, an inclusion list or dictionary which may involve especially significant 
words for a particular subject matter area or words that are of importance to a particular 
organization. In the discussions of the Area 5 ICSI papers it was remarked: 

"Another complication is that mechanized indexing finds in a paper what was 
important to the author. What happens if there is something in the paper not 
important to the author but of importance to the indexer? One possibility is 
to have a list of words and phrases expressing the interests of a particular 
collection, which the machine looks for in the papers. If this word or phrase 
occurs even once, it should be picked up as an indexing term. " j | ] 
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This approach to the selection problem can be combined with other devices, as in the 
’•Selective Dissemination 11 system described by Kraft in which keyword extraction indexing 
is applied to abstract, title, author's name and manually assigned index terms, after 
processing of all input material against both n in n and 11 out 11 dictionary lists. \J 

The use of abstracts rather than full text as source material makes the selection 
criteria problems somewhat less severe. In addition, there is evidence to suggest that 
the abstract does contain much of the significant information that would normally be 
indexed and the text of the abstract is therefore a fertile field for title augmentation. In 
experiments conducted by Slamecka and Zunde on the comparison of indexing terms 
manually assigned with the occurrences of the names of these terms in abstracts used in 
NASA's STAR system, it was found that 80. 4 percent of the assigned terms were contained 
in the abstracts. — Swanson, on the other hand, suggests that, at least for short articles 
having homogeneous subject matter, title and first paragraph "are nearlv as good as full 
text. " U 

A combination inclusion- exclusion list system may involve prior "weighting for 
relevance"of words that are judged by human analysts to be significant for purposes of 
search and retrieval, as suggested by Swanson, for example: 

"The computer first separates those words which are important for purposes of 
information retrieval from those which are unimportant. This is accomplished by 
means of looking up each word in an alphabetized word list with which the computer 
is furnished. Each word in this word list carries a 'weight 1 which reflects an 
estimate of its importance for retrieval purposes. Words of zero weight are 
completely unimportant and discarded by the computer for indexing entries." 

Continuing work at Thompson Ramo -Wooldridge on automatic indexing methods includes 
further investigation of assignments of relevance weight estimates to words and phrases, 
(1959 [490] and [491], 1963 [602]). 

3. 2. 2 Book Indexing By Computer 



For internal indexing, that is, the subject indexing of the contents of a single book or 
report, automatic indexing experiments are usually directed toward the processing of 
full text, with use of stop list ' of various lengths. The work of Artandi for her doctorate 
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at Rutgers in indexing o£ a book by computer programs (1963 [20.3 and [ 223) is an example 
of such modified derivative indexing. Specifically, Artandi's method involves: 

(1) Establishment of a list of key terms appropriate to a given subject 
area to be used as an inclusion list for word extractions from text. 

(2) Application of an appropriate syndetic apparatus to be used in the 
compilation and ordering of the index entries. 

(3) Means for the automatic selection of index entries other than those 
on the pre- specified inclusion list, especially for the selection of 
proper names. 



The text used by Artandi for her study consisted of a 59 -page chapter on halogens 
from J. W. Mellor's Modern Inorganic Chemistry. This text was keypunched with 
special tags being assigned to indicate the page numbers and the incidence of capitalized 
words in the text. Text words greater than three characters in length were first checked 
against the inclusion dictionary of "detection terms". There was, in addition, an 
"expression term" dictionary which constituted the vocabulary of the final index and in 
which a given expression term might or might not be identical with the corresponding 
detection term. Cross-references were supplied by a program routine which checks the 
index term list against a list of expression terms with their detection terms grouped 
under them and which compiles cross-reference entries, one for each detection term 
associated with an expression term appearing on the index list. 

For her experimental corpus, Artandi's program developed 363 page references, 

138 different index entries and 35 cross-references. She compared these results with 
those obtainable by conventional human indexing with respect to the factors of heading 
density (ratio of number of entries to number of words in the book), entry density (ratio 
of the number of page references to the number of pages), and distribution (ratios of 
entries for chemical compounds, proper names, and subject entries to the total number 
of entries. No indexing errors were found in the computer-generated index for a 5 
percent random sample of the pages of the corpus, but five omissions were found in the 
machine indexing of these sample pages. Artandi concluded, however, that although the 
quality of indexing appeared favorable, the costs, which approximated $1. 50 per page 
indexed, were unpractically high. 



Book indexing by computer has also been investigated by Maloney, Dukes, and Green 
at the Army Biological Laboratories, Fort Detrick, Maryland..!/ Input is based on the by- 
product paper tape generated when the manuscript is typed on a tape typewriter. The 
paper tape is in turn converted to punched cards v/hich are then processed by a UNIVAC 
SS-90 II computer in an editing run that deletes unrecognizable codes and then stores page. 
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C. J. Maloney, private communication. A report by C. J. Maloney, J. Dukes, and 
S. Green, "Indexing reports by computer" is in process of preparation for 
publication. 



line, sentence number and other reference identifications. After re -processing against 
a stop list of com* on words, all other words in the edited text are selected as 
candidate index cntrie’, these are then sorted into alphabetical order with subsequent 
printout giving each word occurrence followed by the entire sentence which contained it 
and the page and other location identifications. This computer output is then post- edited 
manually not only to eliminate trivial entries but also to normalize terms and phrases 
used. 

3. 2. 3 Modified Derivative Indexing - Baxendale's Experiments 

As has been previously noted in the introduction to this report, the name of Phyllis 
Baxendale together with that of H. P. Luhn is generally accorded credit for pioneering 
efforts in the entire area of automatic indexing. Baxendale in particular is generally 
credited with the first actual experiments in modified derivative indexing. In investiga- 
tion beginning in the late 1950's, she has explored not only statistical approaches to 
automatic selection of index terms (based for example on word frequencies) but also the 
use of word pairs, word groups, contextual associations, and in particular the subject- 
indicating clues of prepositional phrases (Baxendale, 1958 [41], 1961 [40], 1962 [42]; 
Becker, i960 [44]; Edmundson and Wyllys, 1961 [l8l]). 

Baxendale began by considering the patterns of scanning that humans typically use 
to select "topic" sentences, phrases and words, and she then proceeded to simulate by 
computer program the selection of phrases consisting primarily of nouns and modifiers. 

In her fiist experiments, (1958 [41]) she used two methods of automatic selection. In 
the first procedure, words serving the grammatical functions of pronoun, article, 
auxiliary verb, conjunction and the like, were deleted by stop list lookup. Frequency 
count statistics were then derived for the remaining words. In her second procedure, 
the computer was programmed to select prepositional phrases from text and to use the 
four words succeeding the preposition as index entries unless an additional preposition or 
a punctuation mark is first encountered. 

In later experiments, Baxendale has explored possible grammatical models "which 
would select all and only nouns or adjective -noun combinations". 1/ Taking as an initial 
corpus a sample of document titles, rules were devised to reject for human analysis titles 
with question-marks and the like, to eliminate numeric information and single symbols, 
and to segment the title into its component clauses and phrases by the detection of 
commas, peri ,ds, and similar clues. By list lookup, certain words are identified as 
capable of serving the syntactic functions of being quantifiers, prepositions, or clause 
introducers. Special subscripts are then assigned to these words and the subscripts are 
examined by machine to provide further segmentation; to delete quantifiers, auxiliary 
verbs, or words ending in "ed" or "ing" and preceded by an auxiliary verb, at.cl to deter- 
mine relationship functions between the remaining, presumably substantive, woids. 

Still other work by Baxendale has been directed toward the development of frequency 
of co-occurrence or textual association of candidate indexing terms. She reports as 
follows: 
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"tin the frequency matrix} . . the diagonal elements . . . give the total frequency of an 
index term and the off-diagonal gives the frequency of co-occurrence of two terms. 
The diagonal of the 'context* matrix represents that portion of the total vocabulary 
with which an individual term has been coordinated, and the off-diagonal the extent 
to which two terms have common context. . . Such matrices give a basis for examining 
the extent to which terms are generic or specific within the context of the collection 
of documents. One can speculate that terms occurring with high frequency and wide 
context, i. e. , with frequencies distributed amongst ail ui' nearly all off-diagonal 
elements of the matrix are of such broad connotation as to be indifferent discrimina- 
tors of content . . . The frequency and context matrices can again be used to deter- 
mine the modifiers with which they can most meaningfully be coupled for the 
collection of documents being considered. " 

Finally, Baxendale notes that on the basis of her studies it should be possible to 
select quasi -subject headings based on frequency counting criteria, but then to order the 
remaining vocabulary of selected terms according to contextual measures of association 
which are semantic, syntactic, or statistical in nature. Experimental results for a 
collection of 1,500 documents included semantic associations between "searching" and 
"retrieval", syntactic associations of "machine" or "literature" with "retrieval", and 
the apparently misleading association of "metal" with "retrieval", which, however, had 
statistical significance within the particular document sample. — 

Other investigators who have ejq>lored noun- adjective clues for selection include 
Anger, Chonez, Langleben and Shumilina, and Swanson. Anger looked for relationships 
indicated by syntactic dependencies or by noun-adjective and adjective -adverb linkages, 
and gave in an appendix a suggested program for phrase inversions. Chonez has 
described a computer program which by recognizing "separating" words, especially 
prepositions, and applying "pseudo-grammatical" rules compiles an index to English 
language items in the fields of ionized gas physics and thermonuclear fusion. It is 
claimed that: 

"The subject index thus prepared is similar in presentation to Luhn's KWIC indexes, 
but is fundamentally different in conception and is in fact intermediate between. . . 
(this) . . . and the conventional alphabetic subject indexes. " 

Langleben and Shumilina are concerned with machine- aided procedures for trans- 
lation from natural language materials to an intermediary or documentation language. 
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They indicate, for example, that the preposition "from" serves as a key for the treatment 
of two nouns connected by it. U Swanson, describing research project progress at Ramo 
Wooldridge as of I960, reported to the National Symposium on Machine Translation with 
respect to multiple meaning problems as follows: 
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"We are also investigating the possibility of discovering semantic attributes of 
words based upon certain automatically recognizable statistical features of the 
context. Our initial endeavor in this direction has been to attempt to discover 
a classification system for nouns based upon their frequency spectrum of cate- 
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gories of modifying adjectives, these categories being automatically recognizable."— 
Derivative Indexing From Automatic Abstracting Techniques 



While Baxendale's work has had certain points in common with automatic abstracting 
or extracting processes, particularly in the use of word frequency statistics and the 
consideration of possibilities for first selecting topic sentences, her major interests in 
this area have been in automatic indexing as such, rather than in machine selection of 
sentences from text to serve as an automatic extract or derivative abstract of the 
document. Much of the machine processing to date of full text for documentation 
purposes, however, has had the latter goal as the principal research objective. 

As we have previously noted, the subject of automatic abstracting or auto- 
condensation is not in itself a primary concern of this survey. Nevertheless, the signifi- 
cant words occurring in the abstract of a document, whether generated by man or by 
machine, are obviously good candidates for indexing terms. Moreover, it has been 
strongly suggested that the questions of using positional, editorial, and syntactical clues 
in order to improve automatic indexing techniques will profit by research that is being 
done in both automatic extracting procedures and in other types of linguistic data pro- 
cessing based upon full text. 3/ 

3, 3. 1 Auto-Condensation and Auto-Encoding Techniques of H, P. Luhn 

Although Luhn's work in the field of documentation aided by machine has had its best 
known and most popular acceptance with respect to the KWIC index proper, even more 
provocative possibilities lie in the development of some of the auto- condensation and auto- 
en coding techniques which he also proposed, especially for full text processing. In this 
area, although he himself has also suggested a variety of possible improvements and 
refinements, the actual experimental work done by him and by his associates has mostly 
been done on the basis of word frequency statistics, 
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Considering first the most frequently occurring words in a given text as too common 
to be subject-indicative (those usually stopped or purged by a suitable exclusion dictionary 
or stop list, for example) and next the least frequent words as being rarely topical in a 
content- revealing sense, Luhn settles upon a middle range of frequency of word occur- 
rence as the basis for his auto- condensation processes. The actual frequency counts are 
computed, together with indications of page, line, and occurrence within the same 
sentence. When this has been done for the complete text, each individual sentence is then 
checked for the "score 1 * of relatively high frequency words occurring in it, and sentences 
with the highest scores are then automatically selected, in textually-occurring order, and 
are printed out as an abstract, more properly an extract, of the document. 

The automatic encoding of documents may be achieved either by taking the high 
ranking words of the selected sentences or by selecting the highest ranking of the words 
in the entire document as index entries. Luhn typically justifies these procedures as 
follows: 

"Of various automatic procedures for deriving typical patterns for characterizing 
documents, the systems here proposed are based on operations involving 
statistical properties of words . . - It is held that the more often a certain word 
appears in a document the more it becomes representative of the subject matter 
treated by the author. In grading words in accordance with the frequency of usage 
within a document, a pattern is derived which is typical of that document and unique 
amongst all similarly derived patterns of a collection of documents. It is proposed 
that the more similar two such patterns are the more similar is the intellectual 
contents of the documents they represent. . . 

"... The creation of an encoding pattern may consist of listing an appropriate 
portion of the words ranking highest on the word frequency list derived from a 
document. Experiments conducted so far on documents ranging in size from 500 
to 5000 words have indicated that word patterns consisting of from ten to twenty- 
four of the highest ranking words furnish adequate discrimination and resolution 
for retrieval, sixteen such words being a likely average. " 1/ 

At Wright-Patterson Air Force Base an automated information selection and 
retrieval system has been developed jointly by Air Force and IBM personnel 
(Gallagher and Toomey, 1963 [205]). It involves both auto-indexing and auto- 
abstracting techniques following the Luhn word- frequency -counting techniques. Pre- 
editing is applied to demarcate fields (e. g. , title, author) an d to flag certain text words, 
particularly proper names, for special treatment. Special treatment, over and above the 
frequency -based selection score, is also given to words in the title field. 

On the abstracting side, modifications to Che original Luhn foi’mula involve 
segmenting sentences in terms of strings of both high and low valued words separated 
by either periods or continuous strings of low valued words, or the assumption that 
long consecutive strings of low value words should weight negatively. The automatic 
extract consists of the highest ranking 20 percent of the sentences subject to the 
restriction that no less than 7 and no more than 20 sentences should be selected. On the 
indexing side, the investigators report: 
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"As it is currently run, the auto-indexing program selects about one word in ten 
as a keyword in articles of three thousand words or less. In articles longer than 
three thousand words it tends to pick about one word in fifteen. This high incidence 
of keywords naturally increases the amount of noise results returned by the query 
program, although good search strategy cuts them down considerably. " 1 / 

As of October 1963, the system was reported to be fully operative although not as 
yet extensively tested in actual use. Gallagher and Toomey give illustrative auto- extract 
results on two tested papers, one being Luhn's own "Automatic Creation of Literature 
Abstracts 1 '. They give comparative results for manual versus machine selection of key- 
words as index or search terms with 88. 6 percent agreement, the human indexers having 
selected, in 6 tests reported, 132 words and the machine method 117. Modifications 
under consideration include pre-edit flagging of terms in author and cited- reference 
fields for special weighting, setting the length of the abstract as a function of the total 
number of woras in an item, and, in the search program, generating additional search 
terms by means of association factor techniques such as those suggested by Stiles. 

To the basic approach of straight-forward word frequency counting, Luhn himself 
has suggested that improvements might be obtained from considering closely adjacent 
words, 2/ word pairs, 2^ and reference to vocabularies specific to a given field, 

Other possibilities are capitalized words and lookup against an inclusion list. He also 
suggests: 

"If certain words could be given in their relationships to other words, more 
specific meanings may be identified by such combinations. These relationships 
may range from the mere co-occurrence of certain words within a phrase or 
sentence to the combinations of specific parts of speech." 5/ 

Various investigators have proceeded to explore these and other possible improve- 
ments, including incorporation of relative frequency information, use of information 
about distances between high -ranked significant words, word pairs and word n -tuples. 



1 / 

Gallagher and Toomey, 1963 [205], p. 51c 



2 / 



3/ 

4J 

5/ 



Luhn, 1959 [384], 
Luhn, 1962 [373], 
Luhn, 1959 [384], 
Ibid, p. 5. 



p. 10. 

p.ll. 

pp. 8 and 10. 
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and other devices to improve detection of significant clues to subject content. Repre- 
sentative examples of such work will be discussed below. In addition, investigators 
abroad have developed modifications to the basic Luhn word frequency approach which 
appear to be necessary when it is applied to languages other than English. 1J 

Thus, for example, Purto reports various investigations conducted by V. A. Argayev 
and V. V. Borodin and by himself with respect to Russian language documents.*'/ Purto 
notes first that the Luhn method as applied to Russian language materials selects 
sentences which, while having the largest "significance coefficients", were not those most 
essential to the meaning and further that: "an abstract in Russian made by Luhn's method 
results in a choice of sentences not conveying basic information and not logically connected 
with each other. " The reasons for such failure he attributes to the fact that words with 
different frequencies are considered equally important within a sentence for sentence 
selection purposes and to the lack of consideration for semantic and grammatical 
connectivity between significant words and between sentences. He then discusses several 
methods for determining connectivity, such » s the rule that the sentences most closely 
connected with each other will be those in which the greatest number of the same signifi- 
cant words occur, 

A somewhat different example of difficulties occurring when the basic Luhn technique 
is applied to material in languages other than English is given by Levery. He describes 
a study of thirty French texts concerned with the development and manufacture of glass. 

He reports as follows: 

"While we followed the classical idea that a relationship between the frequency of 
a word and its significance exists, the fact that we worked with French texts forced 
us to discount the value of frequency alone. 

"French authors generally do not like to repeat the same words, and they vary their 
vocabulary. . . It was necessary to combine the frequencies of words with the same 
meanings or related to the same idea. " 

"A dictionary of synonyms was constructed. . . (and) different versions of the same 
word had to be regrouped. " 



l! 

Note, however, that in the automatic abstracting program at Thompson Ramo- 
Wooldridge, small-scale experiments suggest that automatic abstracting is 
as feasible for other Indo-European languages as for English, (1963 [6033, p. ii). 
Also, at the Centre d'Etudes Nucleaire Saclay, automatic extraction experiments 
are being applied to texts both in French and other languages, see National Science 
Foundation's CR&D report No. 6, [430], p. 20. 

2 / 

Purto, 1962 [484]. He refers to a report "The problem of automatic abstracting 
and a means of solving it", by Argayev and Borodin, apparently available only as 
a typescript dated 1959- 

3/ 

Ibid, p. 3. 

4/ 

Ibid, pp. 3-4. 

5/ 

Levery, 1963 [3593, p.235. 
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3. 3. 2 Frequencies of Word n-tuples - Oswald and Others 

The first alternative to the basic Luhn word frequency approach in automatic ab- 
stracting techniques to be actively explored was apparently that of Oswald and his 
associates. (Oswald et al, 1959 L 459]; Edmundson et al, 1959 Cl80]). Like Baxendale, 
Oswald was interested in word pairs and word groups, particularly compound-noun and 
adjective -nouir compositions, as more revelatory of meaning than single words. Unlike 
Baxendale, however, he was interested in the word group itself as selection criterion, 
whereas she had used word group or phrase clues for the selection of (usually) single 
indexing terms. Differences between their two approaches, both representing very early 
efforts in the field, are summarized by Edmondson and Wyllys as follows: 

’’Oswald's experiment in automatic abstracting differs from Lutin' s and Baxendale 's 
techniques in that it combines the notion of significance as a function of word 
frequency and the notion of significance as a function of word groupings, by employing 
juxtapositions of significant words as the basic unit for measuring the importance 
of a sentence. . . 

"It may further be observed that Baxendale 's exhibited indexes are made up of single 
words rather than word groups, in spite of the strong case she makes for using 
groups. . . 



"Baxendale 's work is concerned solely with the automatic construction of indexes; 
she does not extend her treatment of word significance into the area of automatic 
abstracting. " W 

Oswald's "multiterms", however, were intended to overcome, in the areas of both 
automatic indexing and automatic abstracting, at least some of the difficulty that concepts 
are often expressed in compound nouns, word pairs, and longer groups of words consist - 
ing of n-tuples of substantive words or of phrases. The result of considering both word 
frequency and word- group frequency is that in Oswald's selection- groups it is usually the 
case that only one word of the group has an individually high frequency but the co- 
occurrence feature heightens the significance of the relatively lower frequency words 
with which it appears. Thus, for automatic indexing, Oswald proposed significant word 
groups as indexing terms, and his criteria for selection of sentences to be included in 
machine -gene rated extracts are similarly based on the number of significant groups in 
the sentences chosen. 

Other investigators who have stressed the importance of word pairs and longer groups 
as necessary to reflect concepts include Bar-Hillel (1959 [33]), Black(1963 [64]), Clark 
(1960 [123]), Doyle (1959 [165]), and Salton (1963 [519]). Doyle says succinctly that 
"when a phrase, or some other aggregation of words, stands for a single idea, its 
frequency in a document ought to interest us more than the frequencies of its component 
words. " 2/ Salton considers it desirable to use word groups rather than individual words 



1 / 

Edmundson and Wyllys, 1961 [l8l], pp.231-232. 

2 / 

Doyle, 1959 [165], p. 11. 
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for purposes of identifying document contents and to use data on the joint occurrence of 
words in the same sentence or similar contexts as grouping criteria. Clark points out in 
particular that the use of ordered pairs and longer sequences of words to express a single 
concept may be highly characteristic of the special technical language used in a specific 
subject lie Id, and notably those of the social sciences. 1 j 

Others who have explored word n- tuples as selection criteria for automatic extraction 
operations include such investigators as Szemere, Levery, and Yakushin. Szemere 
'reports an investigation of 39 Swedish patent specifications in the field of 
switching circuits looking for significant word-pairs, with emphasis on noun -adjective 
combinations (1962 [59l]). The objectives of a project headed by Levery at IBM - France 
have been reported as follows: 



"A series of experiments is planned in the fields of automatic indexing of 
technical texts and technical vocabulary analysis. 

"A statistical method will be tested to determine the degree of closeness in 
meaning of words. The method will consist of studying the pairs of words which 
appear together in the majority of texts and calculating a coefficient of corre- 
lation from the frequencies. Such work will result in a standard list of notions 
frequencies for a particular kind of information. 

"Starting from this list, new experiments will be made ^o as to obtain a list 
' of keywords representing each text. The method will use statistical comparison 
between the distribution of frequencies of notions contained in a text and the 
standard distributions obtained for the entire corpus. " 2/ 

Yakushin (1963 [654]) develops a variation of the word -pair principle in which he 
looks for those pairs where the words are, or suggest, names of objects, such as 
"table -leg". He suggests, further, that so-called "basis nouns" can be established for 
a given scientific field and entered into an inclusion dictionary, which also contains codes 
for the lexical classes to which the word can belong and codes for determining whether or 
not the word can join with another as a "basis term". Machine routines are then 
suggested to develop whether or not given terms are jointly part of the same text, whether 
one textually precedes another in a given text, whether or not there is a "nomenclator" 
pair. Depending upon the frequency of occurrence of identical or semantically related 
nomenclator constructions, it is claimed that subject concepts can be detected. That is: 

"The method is founded on the finding in a text of so-called basis terms, 
established by list, and of the words which explain them. These explanatory 
words, which in different contexts refer to one basis term, are grouped and 
ordered according to definite rules into a subject concept. " 1 / 



U 

Clark, 1960 [123], p. 460. 

2 / 

National Science Foundation's CB.&D report no. 11, [430], p. 118. 



3/ 

Yakushin, 1963 [654], p. l6. 
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3. 3. 3 Relative Frequency Techniques - Edmunds on and Wyllys, and Others 

The first comprehensive critique of word frequency approaches .to automatic extract- 
ing and indexing was undoubtedly that of Bar-Hillel (1959 [33], I960 C 34]), followed closely 
by Edmundson and Wyllys (1961 Cl8l]), who themselves have experimented with various 
alternative or improved methods for obtaining measures of word significance by statistical 
analysis. These critics have been in agreement both on many points of specific criticism 
and on suggested possibilities for amelioration of observed difficulties, especially in 
terms of considering relative word frequencies within a particular subject field. In 
addition, several other investigators independently proposed a relative fi equency approach 
at about the same time. zJ 

Some typical expressions of opinion on the importance of relative frequency criteria 
are as follows: 



1 / 



"Let me propose here a system of auto- indexing which, to my knowledge, has never 
been publicly proposed before in this form and which seerns to me superior to any 
other system I have heard of . . . Assume that . . . v/e are given a list of the average 
relative frequencies of all English 'words' ... It would then be possible, for any 
given document, to rank-order all the 'words' occurring in this document according 
to the excess of their relative frequency within the document over their average 
relative frequency. By some mechanically imple men table standard or other, an 
initial segment of this list is selected as the index-set. " 2/ 

"Very general considerations from information theory suggest that a word's 
information should vary inversely with its frequency rather than directly, its 
lower probability evidencing greater selectivity or deliberation in its use. It is 
the rare, special, or technical word that will indicate most strongly the subject 
of an author's discussion. Here, however, it is clear that by 'rare' we must 
mean rare in jjeneral usage, not rare within the document itself. In fact it would 
seem natural to regard the contrast between the word's relative frequency f 
within the document and its relative frequency r in general use . . . as ?. more re- 
vealing indication of the word’s value in indicating the subject, matter of a 
document. " 1/ 



2 / 



3/ 



Compare, for example, Kochen, 1963 [327], p. 7: "The idea of contrasting words 
which occur frequently in a document against the frequency of this word in the 
background language for purposes of selecting index terms seem to have beer 
suggested first by Bohnert and the author, then described in more detail by 
Edmundson and Wyllys, and tested empirically by Damerau. Something similar 
was suggested even earlier by Bar-Hillel." See Bar-Hillel, 1962 [353, p.418, 
footnote, with respect to himself, Edmundson, and Bohnert. See also, however, 
Doyle 1962 [l63], p. 388: "Edmundson and Wyllys were probably the first to 
publicly advocate contrasting word frequencies within a document to word fre- 
quencies within a given field and using these relative frequencies as criteria for 
scoring and selecting sentences. " 

Bar-Hillel, 1959 [33], pp 4-8-9. 

Edmundson and Wyllys, 1961 [I8l3 , p. 227. 
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n We naturally find that the words of greatest interest are those for which there 
exists the greatest contrast between general usage frequency and local (within the 
article) usage frequency. 11 ]J 

"Luhn has bypassed syntactical analysis by taking advantage of the information 
content of the most frequently used topical words in articles . . . Edmundson et al 
take a further step in a desirable direction by bringing in information from outside 
the article being analyzed: words andterms are given greater topical value as the 
contrast increases between the frequency of use within the article and the rarity of 
general usage. " 

"A further refinement of the process of automatic analysis would be the develop- 
ment of special sets of reference frequencies for special fields of interest. This 
would have two benefits; it would become possible 1.0 classify documents as to 
field, and it would become possible to note the significance of words which are 
frequent in the document and frequent in a very large reference class c 0 of 
literature (i. e. , these words would not be significant with respect to c c ) but which 
are rare in the special field. For example, the word 'emotion' might be too 
common in general usage to seam significant, but frequent occurrence of the w^rd 
would stand out in a paper on electronic circuitry (e. g. , of a robot) when compared 
with its frequency in general electrical engineering literature. " 

"One of the ... goals is to investigate a relative -frequency approach to the cate- 
gorization of documents. . . For this investigation it will be necessary to develop 
sets of reference frequencies for words used in different subject fields. It was 
suggested by Edmundson and Wyllys that these sets of reference frequencies, 
when developed, could be used to categorize a document as belonging to a particular 
subject- field, by means of measuring the degree of matching (e. g. , with the chi- 
squared test) between the proportional frequencies of words in the documents and 
the sets of reference frequencies. " 

Two points in the comments quoted above appear especially worthy of note. The first 
is that of introducing at least some measure of referenc# to material other than the 
individual author's own choice of linguistic expression and specific terms. V/ e snail dis- 
cuss this factor in more detail in a later section of this report. The second point, 
derived in part from the first, is the specific suggestion of movement away from purely 
derivative indexing by machine in the direction of automatic assignment indexing and 
automatic categorization or classification. 



1/ 




1959 [165], p. 9 




Doyle, 


2/ 








Doyle , 


1961 [169], p. 3. 


3/ 







Edmundson and Wyllys, 1961 [181], p.228. 
Wyllys, 1963 [653], p. 



4/ 



10 . 



Actual experiments in application of relative frequency techniques to automatic ex- 
tracting processes have been pursued since 1959 by various investigators. Edmundson 
and Wyllys and Damerau (1963 [148]) were certainly among the first. Edmundson and 
Bohnert were engaged in experimental investigations at Planning Research Corporation 
in 1959, .L/and the following year Edmundson, Oswald, and Wyllys worked on the auto- 
indexing and auto- extracting of the 40, 000 words of text contained in nine articles in the 
subject field of missilery. zJ Wyllys has continued work on relative frequencies 
(1963 [653] ). At the System Development Corporation Doyle, in some of his work,hasalso 
explored the relative frequency approach (1961 [ 161]). An example in Europe is work 
reported by Meyer-Uhlenried and Lustig, where significant keywords from abstracts are 
used not only as indexing terms directly, but by means of keyword lists and micro- 
thesauri can also be used to assign documents to specific subject fields (1963 [417]). 

3. 3.4 Significant Word Distances 

Another technique that has been investigated for the improvement of automatic ex- 
traction operations based on the statistics of word frequencies is that of distances between 
significant words. The desirability of attaching greater weight to n- tuples of immediately 
adjacent words and to the co-occurrences of words within the same sentence has been 
mention e a previously. Savage, in relatively early work developing some of the initial 
proposals of Luhn, considered intra-sentence distances between significant words as 
follows: 

"... The criterion is the relationship of the high-frequency words to each other, 
rather than their distribution over the whole sentence. Consequently, it seems 
reasonable to consider only those portions of sentences which are bracketed by 
high-frequency words and to set a limit for the distance at which any two such 
words shall be considered as being significantly related . . . An analysis of many 
sentences and many documents indicates that a useful limit is four or five non- 
significant words between atiy two high-frequency words." 

Doyle has also noted the tendency of words that are in fact highly related in a content- 
revealing sense to co-occur in the same sentence or as quite direct neighbors. The same 
investigator has also suggested that word distances can be used to provide "clustering" 
effects that might, for example, sort out the possibly different topics covere|t.in intro- 
ductory or background discussions, the main text, and various appendices. — 



y 

National Science Foundation's CR&D Report No. 5, [430], p33; Bar-Hillel 
1962 [35], p. 418. 

2 / 

National Science Foundation's CR&D Report No. 6 [ 430], pp 43-44. 

3/ 

Savage 1958 [521], p. 4. Later related work has included a method for generating 
auto- extracts which adds to the high-frequency word sentence scores a correction 
factor for the number of words in gaps between such words. (See Rath et al, 1961 
[4931) 

4/ 

Doyle 1961 [166], p. 7. 
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‘ Related research e££orts in more general areas o£ linguistic data processing suggest 

I inter- sentence distances as criteria for the selection of words and word groups in auto- 

matic indexing and abstracting processes. In natural language text searching, for example, 
, the work of both Swanson (i960 [ 587] , 1961 [ *88] , 1963 [ 58 3 j J, and of Mar on and Ray 2 / 

suggests that limitation of searching to a four- sentence span would eliminate a number of 
j irrelevant responses to search requests specifying the joint occurrence of two or more 

t words. 

i 

Swanson's findings indicated that if two words or phrases contained in the search 
request were found in textual proximity within chese limits, they were highly likely to bear 
( a semantic relationship that is what was intended by the requester. Applying the four- 

| sentence proximity criterion, it was found that the amount of irrelevant material retrieved 

by the text searching svstem could be reduced by 60 percent without serious loss of 
relevant information. ^ J Black cites the four-sentence proximity criterion and notes 
further that it might be used also to retrieve only a paragraph or similar small portion of 
the ful. text, reducing the amount of material to be read by the user, perhaps by as much 
as 90 percent. 2/ 

Artandi, in her book -indexing studies, suggested as a topic for further investigation 
the possibility that proximity of index term candidates as derived from the same section 
of the text could serve to improve the quality of the indexing. Since her computer program 
! checks for duplicate potential entries occurring on the same page, this feature could be 

| used for further analysis, on the assumption that the number of occurrences of the same 

entry for the same page is an indication of the importance of the discussion of the subject 
on that page. 2/ 

3. 3. 5 Uses of Special Clues for Selection 

j Intra- and inter-sentence distances between words are relatively crude examples of 

clues to selection of words and word-pairs which, because of their implied relationships, 
may be especially significant for indexing, sentence extraction, or document categorisa- 
tion. They can be quite readily detected by machine, but the implication that physical 
j proximity is a good measure of significant co-occurrence is often false. Other cluee 

j which can be detected equally well, mechanically, are those which have to do with position 

i and format. 



1 / 

Ray, 1961 [494], p. 92. 

2 / 

Swanson, 1963[583T, p. 9> 1961 [586], pp.298-299. 

8 / 

See Black, 1963 [64], p. 20 and footnote; "The figure 90 percent is derived from 
experience in previous experiments, wherein the amount of relevant materia! 
was scanned and a subjective judgment was formed that the relevant material was 
actually about 10 percent of the total verbiage retrieved. That is, about 10 percent 
of each document contained the relevant material; 90 percent of the document was 
of no relevance but the document as a whole was relevant. " 

4/ 

Artandi, 1963 [20], p.47. 
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Such obvious positional clues as occurrences o£ words in title s, chapter or section 
headings> figure captions > have already been mentioned. To these can be added £irst and 
last sentences of paragraphs, 1/ or of first and last paragraphs as such. JJ Wyllys 
obs eives that other criteria which are detectable in the text by straightforward machine 
procedures can be based on such features as italicizatio.i, capitalization, or punctuation. 
He notes, however, that such "editorial" criteria vary from journal to journal so that 
their usefulness would need to be related to the particular practices of individual 
journals. 1/ 

Somewhat more difficult for machine implementation, but certainly feasible in the 
present state cf the programming art, is the use of specific semantic or syntactic clues. 
Here again, Luhn, Bax end ale, and Edmunds on and Wyllys all anticipate their critics and 
later investigators. Luhn recognized the fact that in at least some applications the 
characterization of documents by isolated words alone would fail to provide an effective 
degree of discrimination. He, therefore, suggested operations to establish word 
relationships, whether based on co-occurrences or combinations of specific parts of 
speech, if Baxendale clearly uses both syntactic and semantic clues, detectable by 
built-in table lookups. 

Representative suggestion* by Edmunds on or Wyllys or both as co-authors include 
the following: 

" ... We have in mind a glossary or dictionary of perhaps one to two thousand 
words that act either as cue words which signal the importance of a sentence 
or as stigma words that signal the insignificance of a sentence for purposes of 
abstracting. 11 H 



U 

See, for example, Wyllys, 1963 [653], p. 27: "One of the first published studies 
in automatic document- content analysis, that of Miss Phyllis Baxendale, brought 
out the importance of the first and last sentences in a paragraph as bearers of 
a good deal of the content of the paragraph. " See also Marthaler, 1863 [399], 
p. 25. 

2 / 

Compare Swanson, 1963 [580] , p. 1: ". . .Some evidence exists to show that for 
short homogeneous articles title and first paragraph are nearly as good as full 
text * " 

3/ 

Wyllys, 1963 [653], p.28. 

if 

Luhn, )959 [384], p. 5. 

If 

Edmunds on, 1962 [178], p. 11, 
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"The criteria foi attributing significance to words . . . may be positional (in virtue of 
their occurrence in titles or section headings), or semantic (in virtue of their 
relation to words like 'summary*), or perhaps even pragmatic (in the case of names 
of specialists mentioned in text footnotes, or bibliography . . . 

"A cataloguer or abstract-writer would naturally give more weight to a technical 
word that appears in a title, in a first paragraph, or in a summary. A machine 
can be programmed to do the same. It can be instructed to recognize the title by 
position and capitalization ... It can place first-paragraph indications. . . It can 
test every heading or subtitle for the words ’■summary* or ‘conclusions' and place 
a. summary indication after each word in the summary paragraphs. " 1/ 

"The statistical criteria ... by no means exhaust tlie potential clues to the 
representativeness of sentences. Among other plausible clues are certain words 
and phrases . . . authors use words such as 'conclusion', 'demonstrate', 'disclose', 
'prove', 'show', and 'summary* (and related forms of these) with high frequency in 
sentences that contain concise statements about the topic or topics of the article. • . 
The occurrence in a sentence of such a phrase as 'it was found that. . . *, 'the 
experiment proves. . . *, or 'the central problem is ... * would indicate probably 
even more sharply than any single word could that the sentence was likely to be 
highly representative of the topics. . . " 2/ 

3.3.6 Recent Examples of Mixed Systems Experimentation 



It is quite obvious from the above samples of suggestions for the use of various 
special clues for automatic extraction, that improved systems will largely depend upon 
a mixture of means for determining subject- representativeness of words, phrases, and 
sentences. Many of the clues suggested by Edmundson and Wyllys are continuing to be 
explored, as mi^d systems, at RAND — ^ and the System Development Corporation, (1962 
[590]), for example. Two specific recent examples of mixed systems experimentation 
are the automatic abstracting experiment programs at Thompson Ramo -Wooldridge and 
the work involving detection of first incidences of nouns at the Harvard Computation 
Laboratory. 



The TRW programs to investigate possibilities of computer generation of document 
auto-abstracts, involving both English and Russian language texts are based upon a 
combination of four different me thou j to measure significance and determine representa- 
tiveness. These four methods are briefly described as follows: 



". . . The Key method has its source of machine recognizable clues the specific 
characteristics of the bod" of the document and is based on a Key Glossary of 
concent words taken from the body pf the document. 



U 

Edmundson and Wyllys, 1961 [181], pp. 227 and 229. 

2 / 

Wyllys, 1963 1653J, p. 25. 

3/ 

See National Science Foundation's CR&D report No. 11, [430], pp. 314-315. 
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11 . . . The Cue method has as its source of machine recognizable clues, the general 
characteristics of the corpus that are provided by the bodies of the documents and 
is based on a Cue Dictionary of function words apt to appear in the body of a 
document. 

"... The Title method has as its source of machine recognizable clues, the specific 
characteristics of the skeleton of the document, i. e. , title, headings r and format r 
and is based on a Title Glossary compromising those content words found in the 
title, subtitles, and headings, but excluding certain words of the Cue Dictionary. 

"... The Location method has as its source of machine recognizable clues, the 
general characteristics of the corpus that are provided by the skeletons of the 
documents and uses a Heading Dictionary of certain function words that appear 
in the skeletons of documents. ’’ XJ 

The Harvard work involving detection of the first incidences of nouns as sentence 
^election and indexing clues is part of a larger- scale program for mechanized informa- 
tion selection and retrieval under the general direction of Salton (1961 [512], 1962 [513], 
1963 [514] and [515])* The specific mixed system involving frequency data, syntactic 
identification clues, and positional criteria is primarily the result of investigations by 
Lesk and Storm (1961 [577], 1962 [358]). Related work takes advantage of computer 
techniques for predictive syntactic analysis and automatic dictionary lookup also under 
development at the Harvard Computation Laboratory (Kuno and Oettinger, 1963 [339], 
[340] , [341]). 

The Lesk-Storm experiments have involved investigations where the hypothesis 
assumed is that the points in a text where the author has first introduced a specific noun 
or nominal phrase, or where he has used, with higher frequencies, a combinati >n of 
first- referred-to-nouns, are most likely to be especially indicative sections of text with 
respect to subject-content representativeness. The assumption is further.that areas in 
which specific "new" ideas, not mentioned previously in the text, are first introduced is 
particularly rich in topical-content concentration. 

The mixed- system emphasis followed by Lesk and Storm, however, is revealed in 
the following comments: 

"It is not, of course, apparent that a count of initial occurrences of nouns ... is by 
itself sufficient to reveal areas of significant information content for purposes of 
abstracting or indexing. Accordingly, the method suggested here must be used 
together with other available means, and is not expected to provide by itself an 
acceptable abstracting algorithm. 11 3/ 

In their actual investigations, Lesk and Storm first made manual counts of initial 
noun occurrences in various sample texts, noting paragraph, sentence, and first 
incidence -of- word identifications. The computer was then used to carry out three 
distinctive tasks: (1) calculation of the number of new nouns for each sentence in the text; 



1 / 

Thompson Ramo Wooldridge, 1963 [603], p. 1. 

2 / 

Lesk and Storm, 1962 [358], p. 1-6. 



3/ 



Storm, 1961 [577], pp. 1-1 and 1-2. 
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(2) computation of functions proportional to the number of initially occurring nouns for 
each sentence, and (3) the preparation of a normalized graph for initial noun occurrences 
by plotting the functional values against each sentence in the text.,1/ Sentence selection 
can then proceed by processes to detect “peaks" on the graph, using a relative criterion 
or weighting function to minimize the effect of high first-noun counts in the beginning 
sentences of a paper. 

Trials were made with a number of different weighting formulas, and the best of these 
involved the obtaining of moving averages of first- noun counts over several adjacent 
sentences. A particular formula covering a span of seven sentences gave results that 
appear to emphasize contextual effects and to reduce the effects of a particular single 
sentence with a large number of new nouns, such as a listing of proper names. The 
resulting abstracts are quite lengthy (e. g. , comprising 20 percent or more of the original 
text), and contain some relatively uninformative sentences. The investigators think that 
the results with respect to satisfactory abstracting are inconclusive but provocative. They 
also conclude that the possibilities for indexing are more immediately promising* "Most 
key definitions are retained in the successful summaries, and the vocabulary reflects the 
topics covered in the texts. " 2/ 

Other examples of mixed- system experimentation! especially involving the use of 
syntactic and semantic considerations, include the work at the General Electric Computer 
Department under Spangler, and work by Jacobson and Plath. In the Phoenix laboratories 
of General Electric, a KWIC type indexing program can be applied both to titles and to 
running text and a contemplated extension is intended to "generate indexes by means of 
word analysis, taking into consideration syntactic and semantic aspects of text lines". 3/ 
Jacobson describes rules for machine determinations of same-meaning occurrences of 
words which may be homographic and for selection of descriptors for indexing simple 
paragraphs by choosing words occurring at least twice with a high probability of having the 
same meaning. Plath reports: 



"Although sentences occur in which the key term or phrase lies buried 
deep down in the structure, preliminary observations indicate that there 
are many others in which the semantic hierarchy closely parallels that 
of the syntactic structure. This suggests that more sensitive vocabulary 
statistics for purposes of automatic abstracting may be obtainable by 
considering only words occurring in positions above a predermlned cut- 
off level in the sentence structure. Alternatively, one might count 
occurrences of words on each level, and then multiply by a fixed 
weighting fee tor in each instance before taking the overall totals. 1 — 



It 

2 / 

3/ 

4/ 

51 



Lesk and Storm, 1962 [358], pp. 1-2, I- 1 * ff. 

Ibid, p. 1-31. 

National Science Foundation's CR&D Report No. 11, [430], p. 21. 
Jacobson, 1963 [292], p. 191-192. 

Plath, 1962 [474], p. 190. 
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3.4 Quality of Modified Derivative Indexing by Machine 

Most of the modified derivative indexing techniques that have been proposed to date 
have few or no indexing results to provide comparative data for purposes of evaluation. 
Moreover, those techniques which are primarily directed to the generation of document 
abstracts rather than indexing terms have been reported to date with a paucity of actual 
examples. _ One of the main reasons for this lack of product- effectiveness data is un- 
questionably the high cost and difficulty of obtaining substantial corpora of representative 
document text in machine -readable form. For the most part, the few examples of 
automatic abstracts produced by machine are sadly lacking in pertinency, relevancy, 2/ 
and in continuity for scanning or reading by comparison with conventional human abstracts, 
whether prepared by author, editor, volunteer specialist in the subject field, or pro- 
fessional documen tails t. 

A few studies have been made for a somewhat larger numbers of examples of "auto- 
abstracts" with respect to differences between several different machine -extraction 
formulas, random sentence selections, and sentences extracted manually. A project 
conducted by IBM's Advanced Systems Development Division for the ACSI-matic program, 
(I960 [289], 1961 [290]), involved 70 to 90 articles on military intelligence items. The 
comparisons were of "auto-abstracts" as against titles, full texts, "pseudo-auto- 
abstracts" comprised of the first and last 5 percent of the sentences of each text, and 
sets of sentences selected randomly, without reference to conventional types of manually 
prepared abstracts and without respect to the quality as such. Similarly, Thompson 
Ramo Wooldridge data (1963 [601]) on machine-extracted and randomly-extracted. 

Sentence sets compare these "abstracts" against manual selection of 25 percent of the 
sentences of each item, rather than against a conventional type of abstract. 

There are however, almost no data available on the possible results of using sentence 
and word-group extracting techniques, applied to machine-usable texts, to the develop- 
ment of indexing entries rather than to the generation of substitutes for document 
abstracts. For this reason, as well as because discussion of the difficulties of evaluation 
in general will be deferred to a later section of this report, the question of the quality of 
modified derivate indexing will be briefly considered below, largely in terms of non- 
quantitative judgments. 

First and foremost, as has been noted previously, is the objection that word-indexing 
typically produces redundancy, scatter of references among synonyms and near- synonyms, 
inclusion of many irrelevant entries at high page and user- scanning costs, omission of 



U 

Pur to expresses regret that the studies of Agrayev and Uorodin, intercomparing 
results of human abstracting, use of Luhn's method, and their own modification, 
used only a single paper (1962 [484]). Storm, (1961 [577]), evaluating the initial 
noun occurrence technique as a measure of sentence and index- term extraction 
significance, reports results for only two papers, both by Quine. Only nine 
articles, with no more than 40, 000 words of text in toto, were used by Bdmundson, 
Oswald and Wyllys in their 1960 experiments ([180]). 

2 / 

Compare, for example Desk and Storm, 1961 [358], pp. 1-29 and 1-30 as follows: 
"A final problem is the ambiguity that may arise by removing two sentences from 
context; two sentences alone do not always permit comprehension. Worse yet, the 
meaning may actually be inverted upon removal from context. For example. . . a 
quote is selected which an unsuspecting reader might think the author supports, 
when he is really attacking the position. " 
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many properly indexable topics or points of interest because the authors did not emphasize 
them or used new and unusual terminology to describe therm failures to achieve con- 
sistency both of reference and index- vocabulary control for the papers of more than one 
author, and the like. 



Additional difficulties are engendered, for word indexing by machine from text as 
against word indexing by people, because of complexities required in programming to 
achieve recognition of even such simple indicia as endings of sentences, ii inconsis- 
tencies of capitalization, 2/ and misspellings.^ Context distinctions between multiple 
meanings of homographic words are even more difficult. Difficulties in achieving good 
indexing quality are increased if only titles are used; those of keystroking and machine 
cost requirements increase as the amount of input material grows. 

For these reasons, early criticisms such as those of Bar-Hillel are largely as 
pertinent today as they were when statistical techniques for computer generation of 
document extracts and index terms were first proposed. For example: 



"There can be no doubt but that computers are in a position to select out of the 
words or word- strings occurring in the encoded form of the original document 
those words or strings which fulfill certain formal, statistical conditions, such 
as occurring more than five times, occurring with a relative frequency at least 
double the relative frequency in general. . .However, it is ... unlikely that the 
set obtained thereby will be of a quality commensurate with that obtained by a 
competent indexer. First, there will be serious difficulties as to what is to be 
regarded as instances of the same word . . . Second, there arises . . . the problem 
of synonyms. Third, and most important, this procedure will yield at its best a 
set of words and word strings exclusively taken from the document itself. " 



On the other hand, there are many situations where, because of time factors or lack 
of conventional indexing resources, even unmodified derivative indexing by machine is 
itself of value and therefore modifications to improve the quality of results, whether 
made by man or by machine, may be well worthwhile. As Anzlowar claims: "The in- 
creasingly widespread KWIC indexes ... can save so much in time and effort that they 
surely deserve better than the somewhat haphazard 'slash-dash -ing* now done in most 
in most instances as the only cerebral operations thereon." 



1 / 

See Luhn, 1959 [384], p.22: "Amongst the difficulties encountered in the processing 
of machine readable texts, inconsistencies in the use of punctuation marks, com- 
pounds, capitals, spacing and indentations have been a problem way out of propor- 
tion with respect to the simple functions these devices stand for. For instance, 
even with the aid of a dozen different tests performed by the machine, the true end 
of a sentence cannot be determined with certainty. " 

2 / 

See Artandi, 1963 [20 J, pp. 52ff, on problems of capitalization of proper names. 

3/ 

See Wyllys, 1963 [653], p. 15. 

4/ 

Bar-Hillel, 1962 [35], pp.417-418. 

5/ 

Anzlowar, 1963 [16], p. 104. 



90 



Modifications to derivative indexing techniques that tend toward normalizations of 
terminology and word usage, and increasingly sophisticated proposals for machine use 
of syntactic, semantic, and contextual clues hold out the promise of transition to more 
truly "subject" indexing and to automatic assignment indexing systems. 

4. AUTOMATIC ASSIGNMENT INDEXING TECHNIQUES 

Answers to the question of whether indexing by machine is possible are actually 
dependent in part on how the question of whether what can be achieved by machine is or 
is not properly termed "indexing" is answered. If "indexing" is defined as being more 
than the mere extraction of words from titles, abstracts, or text, then automatic 
derivative indexing, even when augmented by various modifications, normalizations, and 
editings, does not provide affirmative evidence. In the case of concept- oriented 
definitions of indexing, the question becomes one of whether or not automatic assignment 
indexing is possible. Experimental evidence suggesting that it is will be presented in this 
section. 

We should note first, however, that just as there are differences of opinion as to 
what "indexing" means so there are similar differences, with respect to whether or not 
it represents concepts rather than extracted words. There are also a number of conflict- 
ing definitions of what is meant by "indexing" in contradistinction to "classifying". For 
some, the latter difference is related to questions of the number of labels or surrogates 
assigned to a single item to represent its subject contents, ranging from the assignment 
of a single subject category in a classification scheme involving mutually exclusive 
classes to the assignment of a number of terms or descriptor each standing for one of a 
number of aspects of the subject. For our purposes, however, we shall regard both the 
case of indexing with a number of descriptors and that of classifying to a single category 
or subject heading as being within the province of automatic assignment indexing, re- 
serving the term "automatic classification" for the case where the machine is used to 
establish the classification or categorization scheme itself. 

Actual experiments in automatic assignment indexing by Borko, Borko and Bernick, 
Mar on, Salton, Stevens and Urban, Swanson, and Williams will be discussed briefly 
below. These discussions are generally in chronological order with respect to first 
reporting of results, except that the Salton -Lesk-Storm work reflects a somewhat dif- 
ferent principle of assignment from the methods using clue word approaches and it is 
therefore described after these others have been discussed. Some of ihe similarities and 
differences between the various methods are then indicated. A brief final subsection 
covers related assignment indexing proposals for which experimental data is not available 
or has not as yet been reported in the literature. 

4.| Swanson and Later Work at Thompson Ramo- Wooldridge 

Research on fully automatic indexing as well as on full text searching and retrieval 
at the Ramo- Wooldridge Corporation has been reported as being under way at least as 
early as the spring of 1958. 1/ As described elsewhere in this report, experiments in 
search and retrieval based upon full natural language text had used as test items short 
articles in the field of nuclear physics. In additional experiments representing a 
preliminary "clue word" approach to possibilities for automatic indexing procedures, 
some of this same material was used. 



1 / 

National Science Foundation's CR&D rept. no. 2, [430], p. 32. 
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In these additional experiments, 27 articles in the nuclear physics subject area were 
included in a corpus of 100 articles, the remainder covering a variety of topics. Fre- 
quency counts of word occurrences for the physics material were obtained and the 12 most 
frequent words that were judged to be discriminatory fcr the subject were selected. The 
hypothesis was ther tested, that if any document pertained to nuclear physics it would 
contain at least two of these words. Retrieval was achieved for 25 of the 27 documents 
and the two "irrelevant" documents also retrieved did include information at least peri- 
pherally related to the subject. It was thus evident that the retrieval effectiveness of 
automatic recognition of nuclear physics subject material in the general collection was 
considerably greater than the average effectiveness of retrieving responses to the highly 
specific search questions in nuclear physics that had been used in the full text searching 
experiments (Swanson, I961L586]). 

This second set of experiments provided a transition from the full text searching 
work, which if it can be considered indexing at all is obviously derivative indexing, to 
work in the application of an automatic assignment indexing method to 1, 200 newspaper 
clippings (Swanson, 1962 [ 584] , 1963 C 580]). These were brief news items for which 
machine -readable texts in the form of punched paper tape were available. Thesaurus - 
groups of words likely to be associated with each of 20 to 24 subject headings were first 
compiled on the basis of human analysis of 1,000 or more representative items. These 
word groups were further screened so that no word appeared in more than one group and 
so that each word retained should be uniquely indicative of the particular subject 
category. In the machine assignment procedure, subsequently, if a word occurs that 
belongs to a particular thesaurus group, the corresponding subject heading is assigned 
to the item in which that word occurs. 

Results achieved with this technique appear to be highly promising, at least for this 
type of material. Swanson reports as follows: 

"Approximately 1, 200 brief news items were classified into 20 non hie rare hi cal 
subject categories, both by a human and a machine procedure. Fach item was 
assigned on the average to about four categories. The results of the two 
processes were compared. With the human process as a standard, the machine 
missed only seven percent of the correct subject assignments and made a number 
of irrelevant assignments equal to about 17 percent of the total. Nearly 40 per- 
cent of the automatic subject assignments judged finally to be correct were 
missed by the human catalogers. 11 \I 

While this accomplishment is actually due to the extensive human effort to compiling, 
organizing, and pruning of the uniquely indivative word lists, it is pointed out that this 
intellectual effort and the programming tasks need to be done only "once and for all". \l 
It is further pointed out that garbles or misspellings in the input text do not appear to 
affect the procedure, there being enough redundancy in the messages so that even if one or 
two clue words are missed, others will be present. \I 



1 / 

” Swanson, 1962 £584 j, p.468. 
II Ibid, p. 469. 

^1 Swanson, 1963 £ 580 j, p. 5- 
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Swanson and his TRW associate's have further proposed extensions of the pre specified 
unique clue-word technique. For example, it is suggested that machine processes of 
comparing words of titles, subtitles and chapter headings to lists of possible subject 
heading can be extended in sophistication by machine lookups of synonym groups and of 
characteristic subject-word associations. hJ Frequency weightings may be taken into 
account, and similar measures of association and subj|ct-indicativeness may be 
developed for phrases as well as for individual words. — In general, however, the 
apparent success of this clue -word technique in tests co date should be considered in the 
light of the special character of the items, their extreme brevity, and the high probability 
that the fact- word incidence involved in news reporting is not typical of less popular and 
less factually oriented materials. dJ 

Continuing work along similar lines has been carried forward at Ramo- Wooldridge in 
the "Word Correlation and Automatic Indexing Program" sponsored by the Council on 
Library Resources (1959 [ 490] and [49l]). Here, the objectives are to develop and apply 
clue -word techniques to material that is much more representative of the scientific and 
technical literature. The thesaurus-groups, now called "indexonym" groups, are made up 
of words and phrases selected by extensive human analysis as being significantly "useful- 
for- retrieval- purposes". 

New items would be processed in a word and phrase lookup operation, with each word 
or phrase being initially assigned the identifier number codes of all groups to which it 
belongs. However, unless a particular group.'s number is repeated several times within 
the space of a few paragraphs, it is not used as the basis for the actual assignment of an 
index tag. Provision would be made for calling human attention to items having a number 
of words that are not deleted by processing against a "useless-for-retrieval purposes" 
list, but that are not found in any of "accepted" groups. It is suggested that in this way it 
should be possible to "ascribe measures of automatically recognizable 'newness 1 to 
technical articles". 

4.2 Maron's Automatic Indexing Experiments 

By April of 1959, the reports of work at Thompson Ramo- Wooldridge on automatic 
indexing and related problems submitted for the Current Research and Development in 
Scientific Documentation series included reference to Maron and a "probabilistic model for 
the assignment of index tags", as well as to Swanson's continuing projects. 



1 / 

2 / 

3/ 

4/ 

5/ 



Swanson, 1962 [584], p. 469. 

Swanson, 1963 C 580] , pp. 1-2. 

See also Mooers, 1963 [424] . 

Thompson Ramo Wooldridge, 1959 [491], p. 2A. 

National Science Foundation's CR&D report No. 5 [430], p. 34. 
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In addition to his work on probabilistic indexing with emphasis on relevance 
weightings for index tags manually assigned, Maron has actively explored automatic 
assignment indexing chniques. The approach is also probabilistic, with emphasis on 
the statistics of association between content- indicative clue words ^and subject headings 
manually assigned to sample documents. The experimental corpus consisted of a group 
of abstracts in the field of computer technology indexed to 32 subject categories designed 
for the purposes of these investigations. 

Common words such as articles and prepositions were first excluded. N'ext, words 
occurring less than three times were purged and words such as "data" and "computer" 
were also rejected because they occur so frequently in this literature. Approximately 
1, 000 words remained after these purging operations. After sorting the source docu- 
ments to their most appropriate subject categories, statistical frequencies were 
obtained for the co-occurrences of the candidate clue -words with the categories jud the 
resulting listings were manually examined to determine which words peaked in a 
particular category. Eventually, 90 such words were selected. 

The occurrence of one or more of the 90 clue -words in the text of new documents was 
then used to predict the subject category to which the new item should belong. U Tests 
were run with two groups of documents, one consisting of the source items from which 
the statistical frequency and word list data had been obtained, and the second group 
consisting of 145 genuinely new items. For the latter group, twenty documents contained 
no clue words whatever and forty items had only one. For the remaining 85 items having 
two or more clue words, the results of the computer assignment program were predic- 
tions of the correct category in 44, or 51. 8 percent, of the cases . U Results using the 
source documents were significantly better, as expected, with 84. 6 percent accuracy of 
category prediction for 247 items. Results were also related to the number of clue words 
that occurred in the test items, with a prediction accuracy of only 48. 7 percent for items 
with a single clue word rising to 100 percent probability of correct assignment if six or 
more clue words occurred. 

Trachtenberg (1963 [608]) has also considered a probabilistic approach to automatic 
indexing and categorization of documents, similar to that of Maron. He suggests the 
investigation of two information theoretic measures with reference to determination of 
which of various possible clue words are significantly discriminating with respect to the 
different categories. He further suggests experiments using 90 clue words and the 
corpus used by both Maron and Borko, but no actual results have as yet been reported. 

4. 3 Automatic Indexing Investigations of Borko and Bemick 

At the System Development Corporation, the work of Borko (i960 [73]), and of 
Borko and Bernick (1962 [77], 1963 [78], 1964 [79]) in the area of automatic indexing 
has involved both automatic assignment indexing and automatic classification techniques. 
They have not only reported actual indexing results but ha/e provided data for the inter- 
comparison of their techniques with the experiments of Maron for the same source 
material* 



1 / 

Note that the word itself is not necessarily used as an index tag or label, as is the 
case for derivative indexing using an inclusion list approach. This is an important 
distinction. 

2 / 

Maron, 1961 [395], p. 257. 



The original Borko approach was based on the principles of factor analysis as these 
had been developed for the analysis of multivariate date, especially in the field of 
psychology, Borko 1 s first experiments were directed to a corpus consisting of 618 
abstracts in the field of psychology, amounting to approximately 50, 000 words of total 
text and 6, 800 different words. These words were sorted by computer program into an 
order reflecting their respective frequencies of occurrence. For the approximately 200 
words that occurred twenty or more times in this corpus, the investiga;or himself 
selected 90 words to serve as index (or, better, index-clue) terms. A matrix was then 
developed for the frequencies of co-occurrence of these words and the documents in whi' h 
they appeared. From this, a 90 x 90 correlation matrix was computed as follows: 

11 To compute the correlation coefficient ... we used the following formula 



r = NZxy - (Sx) (Sy) 

^ /[NSx 2 - (Ex) 2 ] [NEy 2 - (2y) 2 ] 

Where N is equal to the number of documents (618) and x and y are the terms being 
correlated. 11 1/ 

The term- correlation matrix was then factor analyzed and the first ten eigenvectors 
were selected as factors to be rotated and interpreted. Borho emphasizes that: 

"The interpretation must be made by the investigator and is based upon his knowledge 
of the analytic procedures and the subject matter. There is, therefore, a degree of 
subjectivity in the names selected for each factor. These names may be regarded 
as hypotheses about the factor meaning. " 2/ 

Following the derivation of these "classification categories" by means of the factor 
analysis technique, new items may be assigned to the categories on the basis of words 
occurring in their texts (abstracts) in accordance with the following procedural steps: 

"1. Each document, in machine readable form, is analyzed by the computer. 

A list of the index terms and their frequencies of occurrence in each document 
is recorded, 

11 2. The category or categories containing the index term is assigned a value equal 
to the product of the number of occurrences of the word in the abstract and the 
normalized factor loading of the word in the category. If more than one index term 
appears in a category, the products are summed, 

"3. After each index term has been considered, the category having the highest 
numerical value is selected." 



1 / 

Borko, 1961 [73], p. 283. 



2 / 

Ibid, pp. 285-286. 

3/ 

Borko and Bernick, 1962 [7 7 j, pp, 7-8. 
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The choice of 90 clue words in Borko's work with abstracts in the field of psycho- 
logical literature was apparently dictated by a matrix size which would be convenient 
for computer manipulation. 1 / However, it happened to coincide with the number of clue 
words used by Maron in his experiments. Advantage was taken of this coincidence to 
obtain comparative data on the performance of the two assignment -indexing techniques 
as applied to the same material. The 260 computer literature abstracts used by Maroi^ 
as source documents wtre processed to derive a correlation matrix for Maron’s 90 
manually selected words, which was then factor analyzed. Several sets of factors were 
extracted, rotated, and the results studied, with a final selection of 21 categories . 



Since these automatically derived categories did not coincide with Maron's original 
32, it was necessary to analyze manually the total group of 405 abstracts (260 ’’source" 
and J.45 "test" items) and assign them to the new categories, then to study the documents 
falling into each factor -analytic ally derived category to determine which of Maron’s 90 
clue words were category -indicative, and finally to substitute these words in the Bayesian 
equation used by Maron so as to predict which of these classification categories his 
probabilistic method should obtain. 



The same two sets of 260 "source" and 145 "new" abstracts used by Maron were then 
submitted to the computer assignment program which compares the clue words of a new 
item with the numeric values of the predictor words for each factor category, then com- 
putes the score for each item in all categories, and assigns the category with the highest 
score to the item. For the source items, Borko and Bernick's results showed 63.4 
percent correctly classified, by comparison with the 84. 6 percent correctness score 
originally ootained for them in Maron's experiments. For the new items the factor 
analysis method scored 48.9 percent correct assignment by comparison with Maron’s 
original 51.8 percent. U The later investigators therefore concede that the performance 
of Maron's technique was somewhat superior for the same items using the clue words 
originally selected by Maron. 

Further experimentation was then carried out (Borko and Bernick, 1963 [78]) using 
word frequency data for the selection of a new set of 90 clue words and a classification 
scheme for 21 categories was again automatically derived. The 405 abstracts were again 
manually classified to these machine -derived categories by five subject-matter 
specialists and the two investigators. Comparative data were then obtained for both the 
Maron assignment formula and the modified classification system assignments in terms 
of agreement with the manual assignments. 

For the source items, the percentage of machine assignments agreeing with those 
made by people was 62. 7 when the Bayesian ; rob ability formula used by Maron was 
applied and 61.2 for the factor analysis score system. For the new items, the 
corresponding correct percentages were 57. 9 and 55. 9. Additional data compared the 
effects of using the original Maron words and the frequency -based word set (Borko's 
words) for the same probability formula arsignment method. "While there was an overlap 
of approximately 50 percent between Maron's words and Borko's words, the findings 
indicated that: 



1 / 

Now increased to 150 x 150. 

2 / 

Borko and Bernick, 1962 [72], pp. 9-10. 
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"... The index words selected by Maron are decidedly specific to the documents 
from which they were derived and are of less generality than the frequency based 
terms. The Bayesian formula coupled with the Maron words correctly predicted 
the classification of 7?. 6% of the documents inGroupl[ 'scarce items’] but only 
45. 5% of the documents in Group II [ ’test items']. The coupling of the Bayesian 
fo'rmula with the Borko words resulted in a slight decrease in the percentage of 
Group I documents whose classification was correctly predicted (62. 7%) but in- . 
creased the percentage of correct prediction for Group II documents to 58.0%." — 

Other findings from the later experiments indicated that despite the differences in 
the two word- sets, the factor categories derived from them were very similar. It was 
also found that, at least for the source items (Group I), the two machine techniques and 
the manual process classified 56. 1 percent of the items into the same categories. It 
should be noted, however, that in the case of the automatic assignment methods: "Eleven 
documents contained no clue words and could not be automatically classified by either 
system. ” 2/ 

4.4 Williams’ Discriminant Analysis Method 

The work of Williams in automatic assignment indexing, reported in the fall of 
1963 [642], has also involved tests on abstracts of the computer literature, directly 
comparable to but not necessarily identical with those used by Maron and by Borko and 
Bemick. This work at IBM’s Federal Systems Division, Bethesda is based in part on 
earlier work by Meadow which involved computer studies of matching functions for 
document word lists and category word lists for test items drawn from such fields as 
psychology, law, computer abstracts, and news items. 3/ What has subsequently been 
developed is termed a "discriminant” method which begins with hierarchical classifi- 
cation structure of pre-established subject categories and with a small set of sample 
documents previously indexed by people into these categories. Frequency counts of words 
in each of the sample documents lead to computations, for each category, of the theoreti- 
cally probable frequencies of its most statistically significant words. For new items, 
observed word frequencies are compared with the theoretical word-category associations 
and a relevance value is computed for the item in terms of each category. 

Thp corpus selected for experimentation consisted of 400 items from "Computer 
Abstracts on Cards". These had previously been indexed using a classification 
structure of 15 major categories, each of which is divided in turn into 10 subcategories 
The experimental sample, however, was so selected as to provide exactly 15 "source" 
items and 5 "tew" items for each of 5 subdivisions of 4 of these major categories. 



y 

Borko and Bernick, 1963 [78], p. 23. 

21 

Ibid, p. 11. 

3/ 

Williams, 1963 [642], cites .‘-I. R. Meadow, "Statistical Analysis and Classification 
of Documents”, IRAD Ta.sk Nc. 0353, FSD IBM, Rockville, Maryland, 1962, but 
this if apparently a company -confidential document, containing proprietary in- 
formation. Meadow gave an informal report on her work at the Computing Center 
seminars. University of Maryland, in March of 1963. 

4/ 

Available on a subscription basis from Cambridge Communications Corporation, 
Cambridge, Mass. 
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Discriminant coefficients were then computed it both the major and minor levels for 
all words occurring in the sample items falling into one of the 20 groups in accordance 
with the formula: 

"The discriminant coefficient is: 



These coefficients are used both to set up threshold values to determine which words 



themselves. 

The results of the experiments to date are based on 83 items from the "reference 
set" which were not used as source items. For 63 items, 78 percent were correctly 
classified at the level of a single major category (e. g. , "Programming", 'Hardware 
Design") and also correctly classified at a single subcategory level, (e.g. , "Program- 
ming Languages", "Semiconductor Devices"). The 20 remaining items were classified 
to one major category with an accuracy of 95 percent and to two minor level subdivisions 
with accuracies of 60 percent and 75 percent. Additional investigations were made on 
the effects of using a .scrimination threshold to eliminate insignificant words from 
consideration and on the use of weighting factors in the assignment calculations. 

4.5 SADSACT 

Stevens and Urban at the National Bureau of Standards (1963 [569, 570]) have also 
explored an automatic indexing technique that uses, as in the experiments of Williams, 
a teaching sample or reference set of previously indexed items to form patterns of word 
and index-term assignment associations. However, there are much less formal require- 
ments for computing correlation coefficients and no consideration is required of either 




Where: 



m 



The relative frequency of the ith word 
in the jth category. 



P.. = f.. / 2 
U U ; 



and 
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shovld be used in the assignment formulas and to assign weighting factors to the words 
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Williams 1963 [642], p. 163. 
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the theoretical probabilities of word occurrence by category or of discrimination co- 
efficients and thresholds. Instead, the technique involves ad hoc statistical associations 
between the words occurring in the title and in the abstract of a sample item and the 
descriptors previously assigned to that item. A master selection-word vocabulary is 
thus built up where each word is listed in terms of the frequencies of its co-occurrence 
with each of the descriptors with which it has co-occurred, regardless of whether or not 
such prior associations are either revelant or significant. No attempt has as yet been 
made to "purge" the resulting association lists. Instead, reliance is placed on the 
patterns of multiple word usage and of redundancy of words used in titles and cited titles 
of new items to minimize the effects of irrelevant or accidental prior word-descriptor 
associations and to enhance the significant ones. 

The SADSACT method (for "Self Assign' d Descriptors from Self and Cited Titles") 
proceeds with the assumption, which it shares with the arguments for citation indexing 
previously discussed, that the literature references cited by an author are indicative of 
the subject content or contents of his paper. }J For the automatic indexing of new items, 
their titles and the titles of up to ten bibliographic references cited are keystroked, con- 
verted to punched cards, and fed to the computer. This input material is run against the 
master vocabulary to obtain for each input word which match" s a vocabulary word a 
"descriptor -selection score" for each of the descriptors previously associated with that 
word. These scores are summed up for all words and at an appropriate cutting level 
those descriptors having the highest scores are assigned to the new item. 

Preliminary results based on the titles and cited titles of items that wore "source 
items" in the sense that their titles and abstracts had been used in the teaching sample 
were reported at the NATO Advanced Study Institute on Automatic Document Analysis 
held in Venice in July, 1963. For 30 items drawn from such subject fields as computer 
technology, information selection and retrieval, mathematical logic, pattern recognition, 
and operations research, all of which had previously been indexed by ASTIA personnel in 
i960, the machine assigned 64. 8 percent of the descriptors previously assigned. Sub- 
sequent tests on genuinely new items, however, resulted in a drop to only 48. Z percent 
"hit" accuracy. 

These "new" item results were also evaluated by having several representative 
user? of the collection analyze the test items and assign descriptors to them from a list 
of the descriptors available to the machine. The extent to which the descriptors assigned 
by machine were also independently chosen by one or more of these indexers was then 
checked. In general, the fewer descriptors assigned by the machine, the better vas the 
human agreement, ranging from 47. 4 percent overall in the case where the machine had 
assigned twelve descriptors to each item to 76% agreement where the machine assigned 
only one. In particular, for ten items which were analyzed by five different indexers, 
the chances that one or more would also select the machine's first choice (highest scoring) 
descriptor averaged 90 percent. 

4. 6 Assignment Indexing from Citation Data 

Certain phases in the program of investigation of information selection and retrieval 
problems at the Harvard Computation Laboratory have been mentioned previously. The 
work of Storm and of Lesk and Storm on the use of first- noun -occurrences as selection 
clues for both automatic indexing and abstracting was discussed in connection with tech- 
niques for improved derivative indexing. The studies on citation indexing have included, 
as noted, experiments to assign indexing terms to a new document by finding the indexing 

T7 

If necessary or desirable, however, abstracts or portions of text can be used in 

addition to or in lieu of the cited titles. 
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terms previously assigned to the five most "related" documents, where "relatedness” is 
a function of the similarity in citation patterns as between the new document and items al- 
ready in the collection. The results of such index term assignments are repoited as 
identical to those made by human judgment approximately 50 percent of the time. 1/ 

More specifically, in an experiment using documents drawn from a small collection 
in the fields of mathematical linguistics and machine translation, a new item was com- 
pared in terms of its citation data with the citation similarity data previously determined 
for earlier documents, and the set of five related documents was selected using the 
magnitude of the row similarity coefficients obtained from links of length one and two. 

All index terms occurring at least twice in the set of terms assigned to these related 
items were then assigned to ine new items. For the ten "typical” new item cases, for 
which comparative data are shown, the citation data assignment method correctly ^ 

assigned, on average, 47. 6 percent of the terms assigned manually to the same items. — 

A slightly more sophisticated indexing term assignment formula, described by Lesk, 
was applied to additional test cases, but "failed to raise accuracy above fifty percent”. 2^ 
For five typical new cases, the improved method correctly assigned 11 of the 20 terms 
manually assigned to these items, or an average accuracy of 55. 5 percent . 2 / 

4. 7 Similarities and Distinctions among Assignment Indexing Experiments. 

In Table 2 some of the key points of the various automatic assignment indexing 
experiments we have discussed above are summarized. Certain similarities, distinctions, 
and differences are to he noted. Borko and Bemick use the same corpus as did Maron 
and also re-apply Maron 1 s formula to a different clue -word set for the same material. 
Williams uses material similar to the Maron-Borko computer corpus. The SADSACT 
tests also use some items that might be included in the Maron-Borko and Williams 
corpora. The Swanson experiments with newspaper clippings represent a quite different 
class of material consisting of brief, terse, factual messages. 



It 

Lesk, 1963 [357], p. V-8. 

2 / 

Salton, 1962 [520], p. HI-41, Table 9- 

3/ 

Lesk 1963 [357], p. V-7. 

Ibid, p. V-8, Table 3. 
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Table 2. Summary of Automatic Assignment Indexing Test Evaluations 



Materials 

Investigator Principles and Methods Used Tests Remarks 



Maron 


Statistical probabilities of 
association between clue 
words and pre-established 
subject categories. Source 
items manually indexed to 
32 categories. A subclass 
of words occurring in the 
corpus selected as clue 
words, and statistical cor- 
relations obtained for 90 
such words with categories 
assigned. Correlation data 
and Bayesian probabilities 
used to assign categories 
to new items. 


Corpus of 405 
items selected 
from computer 
abstracts, 

PGEC, 1959. 
Full text, 

20, 000 words ! 
of which 3, 263 
were different 
words. 


For 260 source items, 12 did not 
contain any clue words, 247 were 
indexed, 1 contained an error 
preventing processing. For the 
247 source items indexed, pro- 
bability of top-ranked category 
being correct = 84. 6%. For 145 
new items, 20 not indexed be- 
cause they contained no clue 
words. In 85 cases where at 
least 2 clue words occurred, 
probability of correct category 
assignment = 51.8%. 


Considerable manual 
inspection and judg- 
ment involved in the 
selection of clue 
words. Some new 
items cannot be pro- 
cessed, because they 
contain no clue words. 


Berko 


Factor analysis to determine 
distinctive grouping of clue 
words. Word frequency 
counts made, 90 of the 2. 0 
most frequent non -common 
words manually selected. 
Correlation matrix com- 
puted, factors rotated and 
interpreted. 


Psychological 
abstracts. 618 
abstracts, 

50, 000 text 
words; 6, 800 
different 
words. 


Factors selected were judged 
to be compatible with but not 
identical to subject classif- 
ication terms used for these 
items by the American 
Psychological Association. 


Some new items can- 
not be processed, 
because they contain 
no clue words. 

1 
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Table 2 (cont. ) 



Materials 

Investigator Principles and Methods Used Tests Remarks 



Borko and 
Bernick 


Factor analysis to determine 
distinctive groupings of clue 
words. Maron's 90 clue 
words used for word-word 
correlation and factor 
analysis. 21 factors 
developed, and items 
manually re-indexed to 
these categories. 


Same corpus as 
Mar on, 405 
computer 
abstracts, of 
which 260 
used to 
estab lish 
factors, 145 
as new items* 


Detailed comparison with 
Maron's technique. For the 
source items, 63.4% were 
correctly classified. For the 
new items, 46. 5% correctly 
indexed, and 48. 9% were 
correct for those items in 
vhich 2 or more clue words 
occurred. 


Some items cannot be 
processed because they 
contain no clue words. 


Swanson 


Text word lookup against 
clue word lists, construct- 
ed by careful analysis of 
sample items to be ex- 
clusively indicative of a 
particular subject heading. 
Machine assigns a subject 
heading to an item if any 
word on its list occurs in 
that item. 


Brief news 
dispatches 
available on 
teletype tape, 
wide diversity 
of topics. 

From study of 
several 1, 000 
items., 24 sub- 
ject headings 
established and 
word lists se- 
lected, averag- 
ing approximat- 
ely one hundred 
per category. 
775 new items 
then tested'. 


Macliine assignments compared to 
manual subject indexing. For a 
first batch of 500 items, 569 assign- 
ments of correct headings, 119 
assignments of irrelevant headings, 
and 32 correct headings missed,* 

The clue word thesaurus was then 
revised. For 275 additional test 
items, results showed 282 correct 
assignments, 29 irrelevant assign- 
ments, 1 missed. For total, aver- 
ages of 17% irrelevant assignments, 
3% missed. For 200 items, mach- 
ine and manual assignments were 
compared with respect to 5 of the 
subject categories, with the 
folio-wing results: 

Man Machine 

Irrelevant 4 25 

missed 46 4 

correct 75 116 
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Table 2 (cont. ) 



Materials 

Investigator Principles and Methods Used Tests Remarks 



Stevens and 
Urban 


— 

Teaching sample for machine 
compilation of co-occurrence 
data for words in titles and 
abstracts with descriptors 
assigned to these items. 
Words in titles and cited 
titles of new items then run 
against master list of pre- 
vious word-descriptor assoc- 
iatro" 1 *o derive descriptor- 
seli i scores, highest 

3 COj lescriptors (e.g., 

up to t i) assigned. Assoc- 
iations derived for 1, 600 
words co-occurring with 
any of 70 descriptors pre- 
viously assigned. 


Two teaching 
samples, ap- 
proximately 
100 items each 
with 70% over- 
lap, drawn 
from items in- 
dexed byASTIA. 
For new items 
titles and up to 
10 cited titles. 


For 59 test items, assignments of 
descriptors that had occurred for 
at least 3% of the sample items 
agreed with ASTIA assignments 
58. 1%. However, for all des- 
criptors assigned by ASTIA, many 
not available to machine, overall 
machine accuracy = 40. 1%. For 
20 items, independently evaluated 
by several typical users, the 
chances that one or more people 
would agree with the machine 
assignments ranged from 47. 1% 
when 12 descriptors were assigned 
to 75. 0% average agreement with 
the machine*s first choice. 


All test items could be 
processed and up to 12 
different descriptors 
assigned to each, but 
some descriptors used 
in manual indexing of 
these items are not 
available to the 
machine. 


W illiams 


Discriminant analysis. 
Sample items previously 
indexed to a 2-level clas- 
sification system were 
subjected to word fre- 
quency counts and the 
theoretical frequencies of 
the most significant words 
in each category were com- 
piled. For new items, ob- 
served word frequencies 1 
co npared with theoretical 
frequencies for each cate- 
gory, highest scoring 
assigned. 


Items from 
"Computer 
Abstracts on 
Cards" index- 
ed to 15 major 
categories each 
divided into 10 
minor catego- 
ries. 300 ab- 
stracts selected 
to provide equal 
distribution to 20 
sub -categories, 

5 each in 4 major 
categories. Add- 
itional items for 
test similarly 
selected. 


For 63 new items assigned by 
machine to 1 major and 1 minor 
category, 78% correct at major 
level, 64% correct at minor level. 
For 20 items classified to 1 major 
and 2 minor categories, 95% cor- 
rect at major lev61, 60% and 75% 
correct at the minor level. 





ERIC 
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None of the experiments has so far encompassed testing of anything but very small 
test item samples and the dangers of extrapolating from so small and so specialized 
bodies of data should be clearly recognized. Mooers identifies these dangers in terms of 

"The Silent Postulate: 

(real people) 

That (real documents) can somehow 

(real jobs to do) 



be eliminated from the experimental study. 



and that (role-playing people) 
(substitute documents ) 
(imaginery jobs) 

1 / 



can be substituted and still give valid experimental results. " — 1 



In most of the experiments in automatic indexing conducted to date, indexing and 
classification schedules have been especially designed, or evaluations made, specifically 
for the purposes of these tests. Williams, however, stresses the point that the material 
used in his experiments had been "classified by professional indexers for the purposes of 
actual retrieval. "2/ A similar claim can be made for SADSACT, as noted by Mooers. 3/ 
Swanson's news item work also obviously relates tc real items and implies a real job 
to be done, but is directed, as noted, to a class of material not generally comparable to 
that found in documentation operations on scientific and technical literature. 



In contrast with the treatment of each document as a self-contained entity without 
reference to any other documents, as is the case for derivative indexing, all of the 
automatic assignment indexing experiments, by virtue of the fact that they are assign- 
ment techniques, do to some extent embody the effects of a consensus of a particular 
collection, or a consensus of prior indexing, or a consensus of human subject content 
analysis applied to sample documents, or some combination of these effects. The SAD- 
SACT method, in addition, wherever cited titles are available for new items, takes 
advantage of terminology other than the author's own as a source of clue words. Other 
proposed methods of assignment indexing, such as the use by Salton, Lesk, and Storm of 
citation-pattern similarity data, would carry the latter principle even further. 



1 / 

Mooers, 1963 [424] , p. 5. 

2 / 

Williams, 1963 [642], p. 162. 
3/ 

Ibid, p. 5. 
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4. 8 Other Assignment Indexing Proposals 

A few additional automatic assignment indexing proposals are under development. 
Examples for which experimental data is not as yet generally available include, for 
example, work at EURATOM, some preliminary experiments at Chemical Abstracts 
Service, work at General Electric, Bethesda, the proposed *'Multilinde x** system of 
Information Systems, Inc. , investigations by Slamecka and Zunde, and a special purpose 
development project at Goodyear Aerospace. 

Meyer- Uhlenried and Lustig report for the EURATOM developments as follows: 

*'. . . Procedures are being developed which allow based upon given keyword 
lists first for abstracts: (a) to assign significant keywords and (b) based 
upon hierarchically organized keyword lists, to assign the documents in 
question to specific subject fields. 

“Experiments were made at first on narrow fields with so-called micro- 
thesauri, tiiey showed encouraging results when automatic and manual assign- 
ment were compared. Positive results depend of course on the quality of the 
abstracts and the significance of the words employed in them. It remains to 
see how far this favorable prognosis is confirmed by keyword collections of 
more complex contents.*' If 

Friedman and Dyson (1961 [203]) have reported on manual experiments designed to 
relate words occurring in a sample of abstracts from a particular section of Chemical 
Abstracts to the title or heading for that section. Significant words in these abstracts 
were counted and the number of occurrences as well as the number of different abstracts 
in which they appeared were determined, with a rank order listing as a result. It 
appeared, from inspection, that it should be feasible to develop, for each CA s ection, a 
relatively small vocabulary of words that would be descriptive, and indicative of, the 
subject matter contained in it. They conclude: "In our opinion, the results were signifi- 
cant, the small vocabulary of words did select a large percentage of the abstracts in the 
section it was based on. " 2/ 

A project at Information Systems Operations, General Electric, on possibilities 
for automatic indexing and abstracting of text has been reported in the November 1962 
issue of Current Research and Development.^/The META project (Methods of Extracting 
Text Automatically) is said to be concerned with the use of statistical, linguistic, and 
semantic criteria for analysis and selection of significant words and significant sentences 
from text. Computer programs are being developed in modular fashion for the GE-225 
computer. 



y 

Meyer-Uhlenried and Lustig, 1963 [417], p.229. 

2 / 

Friec’ man and Dyson, 1 961 [203], p. 10. 



3/ 



National Science Foundation’s CR&D report. No. 11 [430], p. 97. 
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The proposed "Multilindex" system is also based on micro-thesauri or small 
vocabularies designed, by human analysis, for clue -indications to a relatively narrow 
subject field, together with potential syntactic- semantic role indications built into the 
dictionary, again by extensive human analysis, following the approaches previously taken 
by A. L. (Lukjanow) Loewenthal in her suggestions for solutions to problems of mecha- 
nized translation. An unpublished proposal-type brochure describing the system was 
available as of December 1963. .L/ As of that date, also, demonstration printouts were 
available from an IBM 1401 Fortran program, illustrating an index compiled from 
abstract-text input and a 1, 200-word dictionary for documents in the field of space an- 
tenna tracking radar. A repetoire of 350 "concepts" or indexing terms was involved, 
with an average of 10 assigned to 22 test documents, many of these assigned terms being 
identical to words occurring in either the title or the text of the abstract of the item. 

Slamecka and Zunde have investigated the extent to which the "notations -of- content" 
in the system developed by Documentation, Inc. for NASA's STAR might be derived by 
machine techniques from the text of the abstracts with enough normalization- standardi- 
zation via inclusion dictionary lookup to qualify as an assignment indexing technique. 

These workers claim: 

"This preliminary investigation indicates the possibility of using the computer 
to index documents adequately for machine retrieval by matching their abstracts 
against an authoritative subject-heading authority . . . The inconsistency inherent in 
human indexing can be eliminated as the number of terms derived from any one 
abstract will always be the same. The abstract and its automatically derived set of 
index terms will always be equivalent. . . u \[ 

A final example of other approaches to automatic assignment indexing research, not 
yet reported in the open literature, is an NIH sponsored project at Goodyear Aerospace, in 
cooperation with the Universities of Minnesota and Rochester and Western Reserve 
University, looking toward an automatic classification procedure based on word coocur- 
rences for a set consiting of 100 four-to-five page documents in the field of diabetes 
literature.. Programs for statistical analyses of the full text of these documents, all of 
which have previously been processed for the manual W. R. U. "telegraphic" abstracting 
system, are being developed. 1/ 

5. AUTOMATIC CLASSIFICATION AND CATEGORIZATION 

In all the experimental work, to date, that has been directed toward the use of 
computers and other machine -like techniques for the automatic indexing of documents, a 



1 / 

"Description of MULTILINDEX. A mechanized system for indexing documents, 
storing information, retrieving information", P. S. Shane, Dec. 4, 1963, In- 
formation Systems, Inc., 7720 Wisconsin Avenue, Bethesda, Maryland. 

u 

Private communications, A. L. Loewenthal and P. S. Shane, Dec. 11, 1963. 

3/ 

Slamecka and Zunde, 1963, [ 561 1 , pp. 139-140. 

4/ 

E. Tuttle, private communication, Oct. 30, 1963. 
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dichotomy can be observed. There is, on the one hand, a spate of examples of automatic 
derivative indexing where words used by the author himself or by human analysis are 
sorted and arranged, by machine, to provide index listings, announcement bulletins, and 
current awareness distribution notices. There are also, on the other hand, at least a 
few instances of investigations where the machine assigns category labels, indexing 
terms, or "heads" and "headings" from a classification schedule, to new items. 



In general, as Needham — points out, proposed automatic assignment indexing pro- 
cedures can be investigated with reference to a previously existing index term vocabulary, 
an existing classification system or schedule, or to specially designed vocabularies and 
subject heading lists. On the other hand, if it is not known how well existing systems do 
in fact characterize documents and if it is not known whether all pertinent properties of 
the documents have been consistently identified, then it may be preferable to develop 
methods for assigning documents to the appropriate class in a classification system which 
is itself set up automatically. 2 / Needham also suggests still a third possibility: that of 
setting up automatically a classification within which the subsequent classifying of docu- 
ments is done by hand. 



The principal experiinen.al results, to date, of attempts to achieve automatic 
classification of documentary items, especially in the sense of machine -generated 
groupings or categorizations of such items, have been those of applying techniques of 
"clumping", -J factor analysis, and "latent class analysis". We shall briefly consider 
below some typical investigations into automatic classification or categorization proce- 
dures that have already had, or may have, applicability in automatic index ing techniques. 

In the late 1950 's, Tanimoto undertook I'heoretical studies of mathematical 
approaches to problems of classification and prediction with special reference to matrix 
manipulations of sets of attributes of items to be classified. Bj He also investigated 

P 

Needham, 1963, [432], p. 1. 

2 / 

Ibid, p. 1-2: "If we are to assign a document to a class automatically, we must 
have a) a list of facts about the classes which will make ascription possible: 
b) an algorithm, usually some sort of matching algorithm, to tell us which class 
best suits a document. Given a classification like the U. D. C. , it is not at all 
obvious that a) and b) exist, or even, if they can be found, a) and b) imply a degree 
of uniformity about the classification which may just not be there. 11 

3/ 

That is, the clustering of objects that are in some sense similar because they 
share certain attributes or properties, even if, and especially when, the identity 
of cluster -producing common properties is not known in advance. 

4/ 

Compare Doyle, 1963 [162], p. 13; "There are other statistical techniques besides 
factor analysis whose output is document clusters, such as latent class analysis 
and clump theory, and there is a surprising increase in research in this kind of 
analysis just within the last two years. " 

5 / 

Tanimoto, 1958 [593], 1961 [594]. See also Borko, 1963 [76], pp. 4-5: "In 
1958, Tanimoto published a theoretical paper on the applications of mathematics to 
the problems of classification and prediction. Specifically, he pointed out how the 
problems of classification can be formulated in terms of sets of attributes and 
manipulated as matrix functions. 11 



theoretical aspects of automatic indexing and sentence extraction involving co-occurrences 
of words. While Tanimoto's studies with respect to linguistic information processing for 
classification purposes have apparently been limited to the theoretical considerations, 
similar concepts of probabilistic, computational, and matrix manipulative operations to 
derive and use coefficients of correlation of associations between such attributes as words 
occurring in text or the index terms assigned to documents are involved in the factor 
analysis and theory of clumps techniques as applied in actual experiments in documentary 
classification. 

5. 1 Factor Analysis 

The factor analysis technique which seeks to derive from word associations in 
representative documents an automatically generated classification schedule for use in 
actual indexing experiments has previously been mentioned. Reasons suggested for its 
use in research at SDC have been reported as follows: 

"The development of automatic procedures for purposes of classification and ab- 
stracting requires the identification and specification of attributes of words or 
passages so that the relevancy of topics or content can be determined. Auto- 
matic procedures to detect such attributes may bo based on a number of 
characteristics of the text: word frequencies, syntactical information, semantic 
information and pragmatic contextual clues. Currently, word frequency informa- 
tion can be generated and manipulated by automatic procedures, whereas the 
other attributes axe not as readily handled this way. However, a correlation 
matrix of content words becomes very unwieldy because of its size and the com- 
plexity of relationships. For this reason, factor analysis is used to identify 
clusters of relationships. Current work concentrates primarily on determining 
the usefulness of factors identified in this way as classification and indexing 
schemes. " 2/ 

As noted above, Borko and Bernick (1961 [73], 1962 [77], 1963 [78]) have applied 
this technique to abstracts drawn from psychological literature and to the same computer 
literature abstracts as had been used by Maron, (1961 [395]). This technique had also 
been investigated in the studies looking toward information retrieval classification and 
grouping undertaken at the Cambridge Language Research Unit from about 1957 onward. 
However, certain apparent limitations of the factor analysis approach led Parker-Rhodes 
and Needham to the alternative of the "theory of clumps" (1960 [465], 1961 [435,464])* 
Parker-Rhodes gives the rationale, and some of the distinctions between the two tech- 
niques, as follows: 

"It has been assumed that statistical methods could be applied to the data in such 
a way as to reveal any objectively existing classes which may be there. The general 



1 / 

Pp. 94-97 of this report. 

2 / 

System Development Corporation, 1962 [590], p. 15. 






name for the techniques evolved in this way is factor analysis. Insofar as it 
is practically applicable this technique has worked well enough; but. . .it has two 
limitations (a) that some classification problems are outside its scope, and 
(b) that it is not susceptible (at least as hitherto conceived) of adaptation com- 
putationally to the study of really large universes. . . *' ]J 

. . The procedure of factor analysis first finds certain clumps, t*.c then, as 
output, it gives us vectors relating the descriptors of the uni verse to the 
clumps found. . . 

''In most cases, factor analysis is used (especially In psychology) to debug the 
descriptor space; more conventionally put, to eliminate those tests (descriptors) 
which have an equivocal membership in several factors (Clumps) in favor of 
those which, having more definite allegiances, conv„y more information of the 
kind which the analysis suggests as valuable. It is thus only related to the 
classification of the uni verse at one remove; the classification it suggests is a 
simple categorical classification defined by the de scriptors suggested as the 
most valuable. . . 

"The descriptive array of auni/erse is a table giving the applicability or 
inapplicability of each descriptor to each element. To classify the elements 
of the universe, we calculate for every pair of elements a similarity as a 
function of the corresponding rows of the descriptive array, and then regard 
the similarity matrix as a sufficient description of the universe. In factor 
anal, sis, on the contrary, we start with the matrix of correlations between 
the descriptors, each being a function of a pair of columns of the descriptive 
array. . . " 

Other investigators who have considered factor analysis techniques for possible 
applications to automatic indexing, automatic categorization of items *n a collection of 
items, or search prescription renegotiation in a mechanized selection and retrieval 
system include Stiles (1962 [ 573]), Doyle (1963 [162]), and Hammond (1962 [251]). 

Stiles, whose principal experimental results relai rather to the use of statistical 
associations between terms manually assigned to documents for search prescription 
formulation and renegotiation than to automatic indexing procedures as such, 3/ has also 
considered both automatic indexing and automatic classification approaches. Specifi- 
cally, he has made at least preliminary investigations of the factor analysis technique 
independently developed for similar purposes by Borko. For a la.*ge collection of 
105, 000 items, the statistics of co-occurrence of indexing terms were in some cases not 
as precise as desired because the same terms were used in different senses for different 
items in the collection. 



1 ] 

Note that Borko himself confirms this limitation as recently as November 1963, 

: *i stating, of the CLRU work on clumps: "However, even now these techniques 
have been applied to a 346x346 matrix which is beyong the capabilities of presently 
available factor analysis programs." (1963 [76] , p-8). 

2 / 

Parker-Rhodes, 1961, T464], pp. 3-6. 

3/ 

This principal concern is discussed below with reference to potentially 
related research, pp. 119-122 of this report. 
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The possibilities of using factor analysis to sort out the different meanings were 
therefore explored. 1/ Using an IBM 704 program, the centroid method of factor analysis 
was applied to a matrix of correlation coefficients of terms that had co-occurred signifi- 
cantly with the term "exposure". Three factors were derived, one generally relating to 
the corrosive effects of exposure, another to "exposure" in the sense of photographic 
exposure, and the third dealing with both exposure-to-weathe.- and exposure-to- radiation. 
Although the results were considered quite satisfactory, more extensive experimentation 
and use aid not appear feasible because of computer matrix manipulation limitations. 

Doyle notes, in particular, that factor analysis might be used to give well-defined 
clusters separated one from another by clear boundaries rather than the less precise 
clusters found by most document grouping techniques. He emphasizes, however, that 
"its success in doing so of course, depends on the well-defined clusters actually being 
present in the data". He suggests that a combination of factor analysis and human 
editing to select items most typical of statistically derived categories could be valuable 
in such applications as the soiting of Congressional mail or the identification of trends 
in political or military intelligence materials free from the personal biases of an analyst. 

Hammond and his Datatrol associates who have worked on an application of the 
Stiles association factor technique for search question negotiation to legal literature have 
also considered *he potentialities of factor analysis. Thus they report: 

"... The present association factor gives the relationship of one term to another. 

A factor analysis study would allow us to determine the relationship of a single 

term to a group of terms. From this we could learn how terms cluster when 

related to the same concept. 11 ,2/ 

5. 2 The Theory of Clumps 

It is assumed, in the work on the theory of clumps, that we have a population of 
objects or items among which at least some classes or groupings do objectively exist, 
but that we do not have any bases for precisely determining class membership require- 
ments. There may, therefore, be many possible ways of grouping and many possible 
definitions of clumps. On the other hand, such diverse definitions must conform to the 
extent of some similarities of membership in the clumps that they define if in fact they 
do define any of the existing classes. Assuming further that we are given information 
about properties ascribable to various members of the population, it is theorized that 
useful clumps can be discovered by investigating similarity connections between pairs 
of items, such as the number of co-occurrences of specific properties. Thereafter, only 
these similarity connections are considered, and the connection matrix is used as the 
basis for trial partitions of the population into various possible subsets. 
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In early work on clump definition, Kuhns of Ramo- Wooldridge ]J proposed the use 
of a threshold value such that if a subset is a clump every pair of members in it has a 
connection strength equal to or greater than the threshold value and no member of the 
subset's complement has connections of more than threshold value to the members of the 
subset. In the more extensive investigations carried out by Parker- Rhodes and Needham 
(I960 [465], 1961 [434, 435, 4643), other clump definitions have been explored and 
specifically that of the "GR- Clump". This is defined as a subset of the universe such 
that all its members have a positive (or zero) bias to the subset and all non-members 
ha\e a negative bias to it, where bias is defined as the excess (positive or negative) of the 
total connections of a member of the population to the members of the subse 1 over its 
total connections to the members of the subset's complement, following the convention 
that the connection of the element to itself is taken as zero. 

An iterative procedure for discovering GE-clumps can now be followed. This is 
based on an arbitrary initial partition of the given universe of elements into a subset and 
its complement. Then, since each element has a bias toward both the subset and its 
complement, differing only in sign, the biases of each element are computed. If the bias 
of a particular element is positive with respect to the subset, it is transferred to the sub- 
set if it is not already a member of it, and conversely if its bias is negative, it is trans- 
ferred to the subset's complement if it is not already there. Each time a transfer is 
made, the biases are recomputed and the process is repeated until for a complete scan of 
all elements no further transfers can be made. The result is a GR- clump even though it 
may have no members or may contain all the elements of the universe. In such case, a 
further partition is made and the procedures are re-applied. 

These GR- clump finding procedures have been applied to such diverse collections 
of items to be classified as archaeological artefacts and patients' symptoms as related 
to specific disease diagnosis. In the latter case, groupings were obtained that corre- 
sponded satisfactorily to certain specific disease syndromes, but no group was found 
corresponding to Hodgkin's disease where a great variety of symptoms typically occur. 
Needham comments: "I can scarcely conceive of a clump definition that would be likely 
to group these patients; I am unsure whether this is a reflection on clump theory or on 
Hodgkin's disease. " 2/ 

In applications more directly related to documentation, some investigations have 
been made of the use of co-occurrence coefficients of index terms assigned to documents 
in order to form a connection matrix from which clumps were then derived (Needham, 
1963 [431]). These experiments covered 342 terms occurring more than once in the 
index- term sets assigned to several hundred documents in the general subject field of 
machine translation. Computation of the matrix required 20 minutes of computer time 
and the 40 clumps found took 6-8 mimUes each to find. Needham reports on the results 
as follows: 
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"Evaluation of the results was unexpectedly difficult. The acid test is presumably 
the efficiency of the retrieval system embodying the grouping given by the program; 
but the efficiency of retrieval systems cannot be easily measured. An apparently 
simpler test would be to see if the clumps were intuitively satisfactory, i. e. , were 
groupings that a classifier in his right mind could have made. This also was un- 
satisfactory because the groups are mostly rather large, larger in fact than 
classifiers ordinarily make, and were thus very difficult to judge. The test 
eventually adopted was to group the terms not distinguished by the clump classifi- 
cation, and look at these. Accordingly, for each term, a list of the clump r. to 
which it belongs was prepared, and groups of terms were found which had all 
their clumps in common. These groups were quite small (2-6 terms) and could 
be studied easily. It turned out that 30 m e groups were ones of which a human 
classifier could have thought (e. g. ■ words concerning suffix removal for machine 
translation came together) while others were quite justified by the documents con- 
cerned, but wc ild never have been thought of a priori. For example, the group: 
"phrase marker, phoneme, Markov process, terminal language 11 was entirely 
justified by the. . . contents of the library. It is groups of the latter kind that 
represent a success for clump theory, for they function usefully in retrieval but 
in no way form part of the structure of thought. . .which the human classifier's work 
is likely to reflect. 11 



Still another application of the theory of clumps may be of use in the construction of 
thesauri (Sparck-Jones, 1962 [564j. Here the assumption is that rows of a correlation 
matrix can be formed for words giving other words which are synonymous with respect to 
meaning. The overlaps of the same word's occurrence in two or more rows can then be 
used to find clumps which are presumed to represent conceptual groupings. 



Applications of clump theory to problems of mechanizes documentation are also 
being investigated by Dale and Dale of the Linguistics Re sear .h Center, the University of 
Texas. They have begun experimentation to derive clumps for the 90 clue words used 
by Borko and the 260 source-item computer abstracts used by both Maron and Borko. 
Preliminary results reported so fai are principally limited to considerations of the asso- 
ciative networks between terms as derived from the structure of the clumps discovered 
by several clump definitions. Mention should also be made of the work of Meetham znd 
Yaswani at the National Physical Laboratory, Teddington, England, looking toward the 
use of similar techniques for machine-generated index vocabularies, with preliminary 
emphasis on testing them against a "library" consisting of the propositions of Euclid's 
geometry. .1/ 
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5. 3 Latent Class Analysis 



Like the earlier work of Tanimoto, the latent class analysis approach of Baker (1962 
[ 27 ] )to problems of automatic information classification and retrieval is at least to date 
theoretical rather than experimental in nature, and so will be considered only briefly here. 
Baker claims that the latent class model developed in the field of the sociological sciences 
for the determination of latent classes among individuals responding "yes” or "no" to 
items in a questionnaire would have attractive features for application to information 
categorization and search, because the model is based upon response patterns that are 
analogous to the presence or absence of clue words or phrases in documents and because 
the analysis yields an ordering ratio that could serve a function similar to the relevance 
weightings suggested by Maron an J Kuhns. 

This ordering ratio is the probability that a given pattern of clue words will occur 
in a document properly belonging to a particular latent class. The probabilities of the 
same pattern being generated by a document properly belonging to other classes are also 
provided, giving an uncertainty which Baker thinks justifiable because a "document could 
generate a given pattern of key words, yet not belong to the same area of interest as the 
majority of documents possessing the same pattern of keywords'*. 1 / It should be noted, 
however, that the question of how to select appropriate clue words is begged zJ and that 
no computer programs are as yet available for carrying out latent class analyses. 1 / 

5. 4 Examples of Other Proposed Classificatory Techniques 

There are certain other document classificatory techniques that have been proposed 
and to some extent investigated experimentally. Trials of document clusterings based 
on co-citingness, co-citedness, or bibliographic coupling as compared with subject con- 
tent groupings have, as noted above, been conducted both by Kessler at the M. I. T. 
Libraries and by Salton’s group at Harvard.—^ Consideration of Doyle's work on word 
co-occurrence statistics has been deliberately deferred to a later section which covers 
his general "association map" approach. Similarly, several other investigations will be 
discussed in terms of potentially related research such as linguistic data processing. 

Two parti cular examples of other suggested classificatory techniques for document 
grouping or classification are somewhat unusual, however. These are the methods pro- 
posed by Te Nuyl and by Lefkovit 2 (1963 [353]). Cleverdon and Mills comment on Te 
Nuyl's method as follows: 
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"Te Nuyl. . .uses, as quasi -descriptors, word-sets chosen from the Oxford English 
Dictionary (e. g. , any word falling between A- Ah) and relies on the subsequent 
correlation of terms to make sense of his seemingly bizarre choice. n 2/ 

Lefkovitz is concerned with the so-called ’'automatic stratification" of a file in 
which both generic or associative relationships and exclusive partitioning is used to 
facilitate search. He claims: 

"... The exclusive partitioning implies a separation of descriptors into groups 
such that no two descriptors in a group co-occur in any given document description 
of the file. This arrangement presents the dissociative properties of the file, or 
forbidden combinations. When coupled with a superimposed display of the 
'inclusive 1 or associative properties of the file a unique classification of the 
descriptors of this file results, which is based solely upon the association of the 
descriptors themselves within the document descriptions and not upon an arbitrary 
set of classes constructed by professional indexers. 11 

The purpose is to assist the searcher by warning him that if he chooses more than 
one descriptor from any one group as terms in his search request, there will be a null 
response from this particular file. However, the particular application considered 
involves a limited number of highly quantifiable or scalable "attribute -value" pairs, (for 
so the descriptors involved are defined), such as "Age-23", and "Hair-red". It is by 
no means obvious that comparable exclusive partitionings could be achieved for literature 
items or that the recomputations necessary as new items enter the file can be achieved 
on a practical basis. 

6. OTHER POTENTIALLY RELATED RESEARCH 

In this section we shall consider certain areas of potentially related research that 
may prove applicable to the improvement of automatic indexing techniques. First is the 
arfca of thesaurus construction and use, which in turn is somewhat related to the develop- 
ment of statistical association techniques, especially for "indexing -at -time -of -search" 
and search renegotiations. Natural language text searching will also be briefly 
considered, together with related research in the general area of linguistic data 
processing. 

6. 1 Thesaurus Construction, Use, and Up-Dating 

The first area of potentially related research which promises improvements in 
automatic indexing procedures is that of thesaurus lookups by machine. There are 
several different possible definitions of the word "thesaurus" in the context of informa- 
tion storage, selection and retrieval systems. The first is that it is a prescriptive 
indexing aid, or authority list, serving the function of normalizing the indexing language, 
primarily by 1 the use of a single word form for words occurring in various inflections, by 
the reduction of synonyms, and by the introduction of appropriate syndetic devices. The 
second definition relates to the intended function for the provocation and suggestion to 
the indexer or the searcher of additional terms and clues, and it follows the idea of v*ord 
groupings related to concepts as in a traditional thesaurus like Roget's. The third 
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possible definition involves the special case of devices or techniques which display or use 
prior associations and co-occurrences or words, indexing terms, and related documents 
to provide a guide or suggestive indexing and search-prescription-formulation or 
renegotiation aid. 

The idea of a mechanized authority list, following the restrictive first definition, 
has been proposed by a number of investigator* 1/ and has actually been used in computer 
programs as discussed for example Dy Schultz and Shepherd (i960 [ 532]), Shepherd (1963 
[545]) and Artandi (1963 [20]). It is the second definition of thesaurus with which we 
shall be principally concerned. It is, as we have said, close to the conventional idea of 
such a -thesaurus as Roget's. It is based on the hypothesis that patterns of co-occurrences 
of words in a new item or in a search request can be compared with patterns of prior co- 
occurrences, as given by a thesaurus "head", in order to expand, clarify, orpin-point 
"meaning" and thus provide a more effective indication of the true subject content. The 
third definition will be considered as falling within the more gener scope of statistical 
association techniques, although as Giuliano points out, "a retrieval *ystem embodying 
an automatic thesaurus thus qualifies as being 'associative'." 2/ 

The application of a thesaurus -like approach to indexing and searching problems is 
again an area in which Luhn is one of the earliest proponents. In January 1953, he 
proposed a new method of recording and searching information in which a special diction- 
ary would be compiled for use in broadening the terms of a search request and in 
normalizing word usage as between various indexers (recorders) and searchers. Al- 
though he did not then use the term "Thesaurus" as such, he said in part: 

"The process of broadening the concept involves the compilation of a dictionary 
wherein key terms of desired broadness may be found to replace unduly specific 
terms, the latter being treated as synonyms of a higher order than ordinarily 
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Some articles might deserve as an indexing term a word not contained 
in the article. By an authority list, the product of the mechanized indexing 
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considered. Translating criteria into these key terms is a process o£ normalization 
which will eliminate many disagreements in the choice of specific terms amongst 
recorders, amongst inquirers, and amongst the two groups, by merging the terms 
at issue into a single key term. However, the dictionary does not classify or index 
but maintains the idea of being fields. . .A specific term may appear under the 
heading of several key terms and if according to its application an overlapping of 
concepts exists then the term is represented by the several key terms 
involved. . . " 1/ 

Xn subsequent papers, Luhn has developed related ideas of a "family of notions" and 
"dictionaries of notional families". — ^ In particular, he emphasizes that for automatic 
indexing, by contrast with automatic abstracting, consideration should be given to the 
normalization of variations in author- chosen terminology; "It will be necessary for a 
machine to resolve variation of word usage with the aid of a device the functions of which 
resemble a dictionary at one level and of a thesaurus at another level of requirements. " 3/ 

The first issue of the National Science Foundation’s compendium of project state- 
ments, "Current Research and Development in Scientific Documentation", which appeared 
in July 195? [430] reported several projects of interest in terms of thesaurus construc- 
tion and use, ^/namely: (1) work by Luhn at IBM involving the establishment of a 
*hesaurus to facilitate encoding of items whose texts would be available in machine -usable 
form, (2) work by Bernier and Heumann at Chemical Abstracts Service looking toward the 
development of a technical thesaurus, (1957 [57]), and (3) an approach to mechanized 
translation proposing to use a mechanized thesaurus at the Cambridge Language Research 
Unit. Tn : s latter project incorporated the ideas of Masterman and her associates from 
about 1956 on (Halliday 1956 [249], Masterman, 1956 [403]; Joyce and Needham, 1958 
[305]), to apply the principle of checking co-occurrences of text words against thesaurus 
"heads" to which tV-sy belonged, in order to resolve homographic ambiguities and thus 
achieve more idiomatic translation by machine. 

For the ICSI Conference in 1958, Masterman, Needham and Sparck-Jones prepared 
a paper discussing analogies between machine translation and information retrieval, and 
recapitulated the arguments of Needham and Joyce for the *.ise of a thesaurus in the 
formulation of search requests, as follows: 

"If a large number of terms are used to describe a document, the existence of 
synonyms is likely: in a system such as Uniterm no attempt is made to bracket 
the synonyms, which, means that a request will produce only the document described 
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in identical terms am 1 not in synonymous ones. If the existence of synonyms 
is avoided, by using a small number of exclusive descriptors, the description 
of a document in terms useful for retrieval is more difficult, also it is equally 
difficult to relate a request to the description of documents. A further difficulty 
is that descriptions only list the main terms, and take no account of their relations 
to one another. The C. L. R. U. experiments being carried out make use of a 
thesaurus* a procedure through which it is hoped that these diff iculties will be 
avoided and that a request for a document although not using the same terms as 
those in the document will produce that document and others dealing with the 
same problem, but described in different, though synonymous, terms." U 

In general, the use of a thesaurus to constrain variations in word or term usage 
(as in our first definition, a mechanized authority list), to reduce synonymity, to resolve 
homographic ambiguity, to provoke and suggest additional terms or ideas to indexer and 
to searcher alike, is related to the improvement of automatic indexing procedures in 
precisely the same sense that its use would be effective in any indexing system whatso- 
ever. In another sense, however, the construction and use of the thesaurus is related 
to linguistic data processing by machine in another way. Garvin suggests; 

. .One may reasonably expect to arrive at a semantic classification of the content- 
bearing elements of a language which is inductively inferred from the study of 
text, rather than superimposed from some viewpoint external to the structure of the 
language. Such a classification can be expected to yield more reliable answers to 
the problems of synonymy and content representation than the existing thesauri 
and synonym lists, which are based mainly on intuitively perceived similarities 
without adequate empirical controls. 11 £/ 

This is with respect to the recognition that the machine itself can be used to compile 
and construct the thesaurus. While Luhn in some of his 1957-8 proposals still considered 
the compilation and organization of a thesaurus to be primarily a matter of human effort, 
he nevertheless pointed out that: "The statistical material that may be required in the 
manual compilation of dictionaries and thesauri may be derived from the original texts 
in any desired form and degree of detail. " De Grolier makes the complementary 
statement that the Luhn techniques should "considerably facilitate" the preparation of 
thesauri, jt! 

Sven more importantly, the computer can be used for periodic up- datings and 
revisions. The work on the FASEB index-term normalization procedures involved early 
recognition of the need to "educate the thesaurus" by examining print-outs when no 
matches occurred and providing a continuous process of amendment. zJ Computer- 
maintained statistics of word and term usages are closely related to possibilities for 
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construction and revision of a mechanized thesaurus, as again Xmhn has suggested. — ' 
Schultz suggests that machine records should be maintained of what thesaurus terms are 
actually used for indexing and searching, the frequencies of term usage, the co- 
occurrences, the number of items described bj particular combinations of terms and the 
like. 2 / 

The potential combinations of natural text processing, automatic indexing, and 
thesaurus construction and updating are stressed in many current programs. For 
example, Eldridge and Dennis discuss: 

"Indexing by machine from natural text in a fully automatic system, in which 
statistical analysis of the words is employed as a device for (a) building auto- 
matically a 'concept' thesaurus, (b) indexing incoming documents with reference 
to the thesaurus, ar.d (c) continuously revising the thesaurus to reflect new word 
usages in currently incoming documents." 

Similarly, Giuliano and Jones suggest that given a term -term statistical association 
matrix, a transformation can be arrived at with a unit vector assigning value only to 
index term Z that ranks every other index term according to degree of association with Z, 
then by listing the higher ranked terms for each term Z, "a 'thesaurus' listing can be 
obtained completely automatically. " 

6. 2 Statistical Association Techniques 

A special definition of the word "thesaurus" might, as we have noted, include the 
development of devices and techniques which either automatically or by man-machine inter- 
action serve to suggest the amplification of a set of index terms. We shall briefly con- 
sider here both devices that visually display associations between words, terms, and 
documents 2/ and techniques for machine use of coefficients of correlation for prior co- 
occurrences in a collection of word -word, word-term, term -term, term -document, and 
document -document associations, the statistical association factor technique as first 
developed by Stiles. 
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times each word is looked up in the index and the number of times each family 
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part of the system for making periodic adjustments based on the usage of words 
or notions as mechanically established." 
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It should be noted that Tabledex, the Scan- Column Index, and similar tools pro- 
vide to some extent a display of prior associations between index terms. (See 
pp. 25-27 of this report.) Thus Cheydleur (1963 [ 115], p. 58) remarks* "Ledley. . 
has focussed on inter- item concepts in designing his economical TABLEDEX 
arrangement for displaying the connectivity of index terms and related file items." 
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6. 2. 1 Devices to Display Associations: EDXAC 

The interest aroused among some documentalists by the provocative idea of a "Memex" 
to record and display associations between ideas as proposed by Bush in 1945 ([93]) led to 
specific attempts at Documentation, Inc. in the 1950's to develop a device which would 
incorporate at least the associations between indexing terms assigned to documents and 
between documents with respect to their sharing of common indexing terms (1954 [157], 

1956 [155, 156]). The first approach to this objective, as reported by Taube, was the idea 
of a manual dictionary of terms arranged in alphabetical order, with a "page" reserved for 
each and every indexing term used for any document in the collection. On each page would 
be listed all other terms that had co-occurred with that term in the indexing of one or more 
documents. Another idea was to display associations of terms used in a collection through 
the "super imposition of dedicated positions in a set of cards or plates. . . 11 _1/ 

Subsequently, an actual device to demonstrate a system for display of tezm-term, 
term-document, and document -document associations, was built under an Office of Naval 
Research contract. Zj The demonstration model contained a vocabulary of 250 terms which 
had been used in various combinations to index 100 reports. Interconnections in an elec- 
trical network provided the associational linkages. A display panel was provided with 
symbol -indicators which could be lighted up to identify particular terms and particular 
report numbers. 

This EDIAC device (for Electronic Display of ^Indexing Association and Content) was 
intended for use both in guiding an indexer to either the extension or refinement of his 
initial choice of indexing terms and in assisting the searcher. It was claimed that the 
operation of such a device would be extremely simple. Thus: 

"For the index question the searcher selects any term in which he is interested 

and applies a voltage. He is told instantly the number of the repqrts dealing with 

that subject. Putting voltage in at any term also lights all other terms associated 

with the first term. . . "3/ 

A later analog device, ACORN, will be discussed below in connection with the work 
of Giuliano and associates, at Arthur D. Little, Inc. 

6. 2. 2 Statistical Association Factors - Stiles 

The name of H. Edmund Stiles, like those of Luhn, Baxendale, Maron, Swanson, 
Edmunds on and Wyllys, is generally associated with pioneering innovations in those areas 
of mechanized documentation which are directly related to the use of high-speed computer 
capabilities. While Stiles' work has been directed primarily to problems of search 
prescription formulation and renegotiation based on the results of preliminary search, he 
has specifically recognized that the use of statistical word association techniques in 
searching operations can provide a legiecl corollary to automatic indexing procedures. 
Thus: 
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"Automatic indexing, based on the relative frequency of words used in a document, 
produces a partial vocabulary of the content words used to express its subject. 
Retrieval can then be accomplished by expanding the request vocabulary. . . This 
method tends to overcome the deficiencies and inconsistencies inherent in the use 
of terms derived automatically from a text. 11 

Conversely, Stiles also points out the possibility that the results of automatic derivative 
indexing procedures, extracting indexing words from the documents directly, might prove 
a more realistic or reliable basis for the development of his word co-occurrence correla- 
tion data than do the Uniterms assigned by human indexers. 2/ The work of Stiles has also 
stressed the importance of two factors that may well be critical for the improvement of 
automatic indexing techniques. These are, namely, the consensus of prior human indexing 
and the consensus of subject coverage of a particular collection. 3/ 

In his experimental investigations, Stiles began with an existing collection of approx- 
imately 100, 000 items which had previously been indexed, over a period of time, with a 
Uniterm indexing vocabulary consisting of about 15, 000 terms. The objective of the 
experiments was to determine how, given a specific search request, a more effective "net 
to catch documents" 4/ could be generated and how the responding items might be ranked 
in order of their probable relevance to the request. 

The statistics of co-occurrence of terms ujed to index the same documents were first 
obtained. A modified chi-square formula was then applied to determine relative fre- 
quencies of use of co-occurring terms. 5 / Patterns of term co-occurrence could then be 
derived in the sense of term-profiles which show, for each term, the more significant of 
its asscciational values of pairing with other terms in the coUection. The actual procedure 
for using these term -profiles in search prescription formulation and in document selection 
involves several steps, generally as follows: bj 
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In general, we shall not be concerned with the precise mathematical formulations . 
It is to be noted that in a recent report Giuliano and his colleagues have reviewed 
a number of the various mathematical formulas proposed in the literature for the 
computation of word, term, and document associations, including those of Parker- 
Rhodes and Needham, Maron and Kuhns, Stiles, Salton, Osgood, Bennett and 
Spiegel {Giuliano et al, 1963 [230], Appendix I). 
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1. For each term in the initial formulation of a search request, the 
appropriate term-profile is obtained, which gives weighted values 
for those other arms that had significantly co-occurred with it. 

2. The profiles of each term in a multi -term request are compared 
and those additional terms common to all or a specified number of 
the profiles are selected and added to the initial set. A/ 

3. The "first generation" terms resulting from step 2 are next treated 

as though they also were request terms, and steps 1 and 2 are repeated 
for them. 

4. A selection is made from some reasonable proportion of the profiles 
associated with the first generation terms to produce the "second 
generation" terms. 1J 

5. The expanded list of search terms is then compared with the index 
terms assigned to each document in the collection, and whenever a 
match is found the weight of the request term is assigned to the 
matching document term. These weights are then summed to provide 
a numeric measure of probable document relevance to the original 
request. 

6. Documents responding to the expanded request are printed out in the 
order of document relevance scores. 

Some experiments have been made using a computer program which accepts up 
to 300 weighted terms in an expanded request vocabulary. Representative results have 
been reported, in part, as follows: 

"... We asked a qualified engineer to examine these documents and specify which 
were related to ’Thin Films* and which were not. . . This engineer was not 
familiar with our project. . . yet. . .we found a remarkably high correlation between 
his evaluation and the document relevance numbers. . . We then checked to see how 
the documents containing information on ’Thin Film' had been indexed. We fotmd 
that the first five documents on our list had been indexed by both *Thin‘ and ; Film'. 
Three more documents had been indexed by 'Film' alone, and other related terms. 
Two documents had not been indexed by either 'Thin' or 'Film', but only by a group 
of related terms, yet they contained information on 'Thin Films' and had a high, 
document relevance number. By using association factors and a series of statisti- 
cal steps, easily programmed for a computer, we were thus able to locate 
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These are called "first generation terms" and tend to reflect only statistical asso- 
ciations without including synonyms and near-synonyms which, over the course of 
time, have occurred in the indexing vocabulary. 

2 / 

Stiles, 1961 1 571 ] , p. 274: "Among these we find words closely related in meaning 
to the request terms." An example given in Ref. £572}, pp. 200-201, is the 
derivation of ’weathering, ’ ’fungicidal’, 'deterioration', and 'preservatives' as 
second generation terms when the initial request included the terms 'plastics’, 
'fungus', ’coating’, and 'tests'. 



documents relevant to a request even though the documents had not been indexed 
by the terms used in the request. " U 



In another case, which was analyzed in detail, a request profile of 26 terms that 
had been intuitively weighted by the customer resulted in the machine listing of 246 
presumably responsive documents. Of these, 81 documents were of primary interest 
to the customer, and an additional 78 were of secondary interest to him. 



The statistical association technique as proposed by Stiles has also been investi- 
gated at the Datatrol Corporation, with particular reference to the field of legal 
literature (Hammond et al, 1962 C 2 5ll )- About 350 documents in the field of Federal 
public law were indexed in cooperation with George Washington University , using a 
vocabulary of 680 index terms. A computer program was written for the IBM 7090 that 
can accommodate a 1200 x 1200 matrix to calculate the Stiles* association factors. Trials 
were made of various thresholds to determine which other terms were sufficiently high in 
association strength to a particular term to be selected for that term's profile. 



Given tbs generation of the term profiles, u. less sophisticated computer such as 
the 1401 can be used for the expansion of request terms and the actual conduct of searches. 
Such a program was demonstrated at the Annual Meeting of the American Bar Association, 
August 1962, with running of "live" requests suggested by jurists and with what are 
claimed to be "highly gratifying results". A point of interest relates to the question of 
updating of term-profiles and other statistical association factor data. Hammond, et al 
report: 



"The term profiles were generated a total of three times in the course of the 
pilot study* making it possible, to some extent, to assess the effect of 
vocabulary growth. Judging from this limited experience, it appears that a bi- 
monthly, or perhaps even quarterly, recompilation of term profiles should be 
sufficient for a mature collection. " 1 / 



6. 2. 3 The Association Map - Doyle and Related Work at SDC 

The name of Doyle is again that of an early and prolific investigator and innovator 
in the field of mechanized documentation and linguistic data processing. One of his 
provocative suggestions is generally known, in his own terminology, as that of "semantic 
road maps for literature searchers" or an "association map" technique. As a matter of 
convenience, we have chosen to consider this suggestion and a variety of related work 



Stiles, 1961 [577], pp. 198-199. 
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Stiles. 1962 [573], p.9. 



3 / 



Hammond et al, 1962 [25l], p. 6. 



under the general heading of the association map technique, —^although passing reference 
has been made to some of Doyle's suggestions and findings elsewhere in this report. 

Beginning in 1958 (Doyltr, 1959 [168]) information retrieval projects at die System 
Development Corporation have had, among other objectives, that of developing ways to 
use computers in the processing and interpretation of natural language text. By February 
of 1959, a computer program was already in operation that could search fragments of 
about 100 words of keypunched text, match input word3 against a pre-established clue word 
selection list (i. e. , an inclusion dictionary) ar.d substitute a short encoded form to be 
used for subsequent search. Processing of keypunched abstracts using this program in- 
volved computer time at the rate of four abstracts per second. 

Other features of this text compiler, and of subsequent text processing programs 
developed at SDC, enable the making of frequency counts and other statistical measures. 
Such features are then used for the investigation of, for example, word-word, word- 
document, and word- subject associations, looking toward the determination of answers to 
such questions as: "Do subject words have distribution characteristics within a library 
that a computer program can detect?" 

Doyle's investigations of word co-occurrences have included hypotheses and tests 
of var*‘ as probabilistic measures in terms of observed frequencies, in terms of "boingl" 
words (so-called because of the mental sound effect they elicit), — in terms of adjacent 
word pairs and affinities between particular nouns and particular adjectives, 4/ and in 
terms of distinctions between frequency (the total number of times a word appears in a 
given library corpus) and prevalence (the total number of items in which a particular word 
appears). 2/ He has also stressed distinctions between adjacent words and high corre- 
lations for wo:.*ds that are not closely positioned together in text, as follows: 



Compare Doyle himself, 1962, [163], p.383: "Swanson and others have offered 
thesauri of synonyms and related terms. . . (to assist in indexing or search 
processes). . .An association map is, in a sense, an extension of this solution; it is 
a gigantic, automatically derived thesaurus. Confronted by such a map, the 
searcher has a much better 'association network' than the one existing in his mind, 
because it corresponds to words actually found in the library, and, therefore, words 
which are best suited to retrieve information from that library." See also Wyllys, 
1962 [651], p. 16: "L. B. Doyle (1961) has invented a fascinating search tool which 
seems to us to belong at a level intermediate between automatic indexes and auto- 
matic abstracts,* i. e. , a possible search method might be to have the computer scan 
automatic indexes and compare the index terms therein with the request, then 
obtain the possibly pertinent documents and display their association map for the 
user to examine. . . " 
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"We have also perceived that two different cognitive processes seem to be 
responsible for each type of correlation, one (adjacent correlation) involving 
the habitual use of word groups as semantic units, and the other (proximal 
correlation) having to do with the pattern of reference to various aspects of 
that which is being discussed. We can call the statistical effects, respectively, 
’language redundancy’, and 'reality redundancy 1 . Such a resolution of statistical 
effects is full of significance for information retrieval because it appears likely 
that reality redundancy can vary greatly from one science to another, whereas 
language redundancy, a universal property of talking and writing, is relatively 
invariant. “ \j 

With respect to the "semantic roadmap" or "association map" technique itself, 
Doyle's suggestion is that various measures of word and index term cross -associations 
may be applied to the generation of graphic dispL vs of both types Of co-occurrence 
relationships. Because of the variety of, in particular, the "proximal" correlations, it 
is assumed that the literature searcher should be given a lisplay in which the repre- 
sentation of the assemblage of the varied relationships is two-dimensional rather than 
one. 2/ An example is given, based upon computer processing of 600 abstracts of SDC 
internal reports to find intersections between 500 topical words, of associational con- 
nections for the word "output". This was generated by selecting the eight words most 
strongly correlated in the data with "output", such as "manual" and "radar", and then 
finding three other words highly correlated with each of these and also correlated with 
"output" itself. From the initial graph, it is further shown that item surrogates might 
be generated by word selection rules applied to documents to pick up, for example, 

"New York Air Defense system data -+ outputs -* D. C. ~ 

Continuing related work by Doyle and others at SDC has included various experi- 
mental studies of "pseudo- documents" consisting of lists of the twelve most frequently 
occurring words in 100-item samples of abstracts in various subject fields (Doyle, 1961 
C 1 61 ])■ Of special interest in terms of potential improvements and modifications to 
machine indexing techniques are studies, based on similar lists, looking to the separa- 
tion of words that may have been used in several different senses, i. e. , the detection of 
homographs by statistical means (Doyle, 1963 [l7l])* More recent investigations by 
Doyle involve considerations of differences between word-grouping and document-group- 
ing techniques and of possibilities for use of hybrid methods. 

6. 2.4 Work of Giuliano and Associates, the ACORN Devices 

A program directed toward the design of "an English command and control language 
system" under an Air Force contract with Arthur D. Little, Inc. , involves sever?! inter- 
related aspects of natural language text processing, use of statistical association factors 
in search, man-machine interaction during search, and display of associational relation- 
ships by means of analog network devices. In this program and in related research, 
Giuliano and his associates are convinced that: 
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11 A' .omatic index term association techniques are needed to improve the recall 
et -ftlevant information; tc nable indexers and requestors to use language in a 
moi natural manner, and enable retrieval of relevant messages which are 
desciioed by different index terms than those used in the inquiry." 1/ 

For the most part, the work to date has been directed to "associative retrieval" of 
messages limited to single sentences of English text, and to the search phases of a pro- 
posed system. 

In the case of a corpus consisting of 230 sentences from a single text, a partially 
automatic indexing method was used. The text was first processed against a modified 
version of the Harvard Multipath Syntactic Analysis computer program and the resulting 
analyses were manuc^ly screened to select a unique, correct analysis for each sentence. 
Next, approximately SCO words, those that had been marked "noun" by the syntactic 
analyzer, were listed out and these in turn were manually screened to provide an 
"inclusion list" of 273 words. Sentences were then "indexed" with respect to which of 
these selected words they contained. Word associations were computed both in terms of 
co-occurrence within a sentence and of co-occurrence in syntactic structures. 



Retrieval tests were ther. applied using bcth computer programs and the analog 
device, and evaluations were made on the basis of examining sentences selected in order 
of machine-ranked relevance and of comparisons of word lists associated with a given 
search term against association lists fox another term picked at random. It is noted that, 
"although quantitative conclusions cannot be drawn", the results support the conclusion 
that: "Items retrieved due to automatically- generated associations tend to be more rele- 

vant than is explainable on a chance basis. " ~J 



The "request reformulation" retrieval program has also been used to generate term 
profiles from a collection of approximately 10, 000 documents (previously indexed with at 
least 6 terms from a selective term vocabulary of 1, 000 terms) which have then been 
compared againstlists provided in the entries for corresponding terms in the Thesaurus of 
ASTIA Descriptors , Second Edition. The machine-produced association lists, at least 
for those words occurring relatively frequently in the corpus, appear to give thesaurus 
entries that are extensive, specific, and intuitively acceptable, and of high quality, 
especially with respect to 1 is lings of synonyms as well as factually related words. — 



The development of the ACORN (Associative Content Retrieval Network) devices 
has provided additional tools for testing and display {1962 [229], 1963 [227, 304]). 

These devices are networks of passive resistance elements. Each word or index term 
and each sentence (240 by 230 in ACORN-IV) are represented by terminals interconnected 
by resistors with conductance equal to the connection strength, and with "leak" resistors 
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providing for various normalizations that may be applied to compensate for word or 
sentence frequency factors. These devices differ from the earlier EDIAC in the 
variable weightings provided, in the normalizations that may be applied, and in multipath 
interconnections. 

When, for example, currents are applied at some of the word terminal? , the volt- 
ages appearing on any of the other word terminals depend on the strengths of association 
between these words and the input words via all direct and indirect paths. The responses 
of sentence terminals to the input words of a query similarly depend upon how strongly a 
sentence is connected to these words and how strongly it is connected to other words 
which in turn are strongly connected to the query words. It is to be noted furt* iu that. 

"Pulling out or cutting a few randomly selected wires in an ACORN generally 
has a surprisingly small effect. . . This insensitivity is of course, explainable 
in terms of the multiplicity of indirect and redundant association paths which 
remain intact when a direct path is severed. . . It. . . suggests that the retrieval 
process can indeed be made insensitive to minor variations in indexing." 1/ 

In addition, there are intriguing possibilities for imposing a "viewpoint" with 
respect to a search by injecting biaq currents. Thus if only non- "Air Force" jet 
plane items were desired, the "Air Force" items could in effect be grounded out. If there 
were no jet items in the collection other than those which were also Air Force items, 
these would be indicated as responsive, but largely they would appear oi. ly if this should 
be the case. Some words used have some connection to almost all other words, but these 
have little effect in the system and the hardware thus tends to compensate for the high 
frequencies of very general words. 

6. 2. 5 Spiegel and Others at Mitre Corporation 

Bennett and Spiegel, reporting at the Symposium on Optimum Routing in Large 
Networks, IFIP Congress- 1962, hi consider modifications to formulas for the calculation 
of statistical association factors which will normalize against such influences as frequency 
of word occurrences, relative word position within a string of words, and string length. 
This work has been carried forward at the Mitre Corporation in a program for developing 
procedures to encode various statistical properties of messages or documents and to use 
these codes for message routing and retrieval. 

Differences between (his approach and those of Mar on and Kuhns, Stiles, Doyle, 
relate primarily to the questions of how best to normalize. The objective is closely 
similar: to use associational weighting so as to provide, in response to a query, output of 
documents or messages ranked in order of probable relevance to the query. 
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Additional features include provision for the matrix of coefficients of association 
to change with time or with deliberate manipulation to improve performance. Thus: 

"Each normalized cell weight. . . rises and falls with time as each specific 
association increases or decreases in relative frequency. In this way, the 
matrix memory of associations changes with time, maintaining a cumulative 
pattern of associations reflecting one statistical characteristic of messages 
fed into it in the past. . . 



"In addition to this adaptive characteristic of changing memory with time and with 
changing inputs, the matrix is also readily subject to formal education. Any 
specific cell weight can be strengthened by repeatedly reading into the matrix 
memory the specific strings that contain the desired associations. For example, 
by introducing the strings is am , is are , am is , am are, and are am, we can . 

increase the statistical tendency of the tokens is, am, and are to be associated. " — 



Experimental results have been obtained for a corpus of 500 bibliographic entries 
contained in DDC's Title Announcement Bulletin . In the caee of a thres-term query, 40 
items were selected and ranked in probable relevance order, with selection based on a 
particular relevance score value threshold. The investigators then reviewed the abstracts 
of all 500 items and rated them as to relevance with respect to the query. Seven 
addition!.! items were found, of which three would have been machine- selected with a 
less stringent selection threshold. For the remaining four, it is reported that they "were 
poorly indexed and could have been judged not relevant by a human who depended upon the 
descriptor string only, as the matrix did, rather than upon review of the abstracts. " 2/ 



6. 3 Clues to Index- Term Selection from Automatic Syntactic Analysis 



Several of the organizations and research teams most active in the investigation 
of linguistic data processing techniques, especially for automatic indexing, extracting 
and search renegotiation applications, are actively considering the use of clues derived 
from automatic syntactic analysis to improve criteria for machine selection of 
"significant" words, phrases, and sentences from raw text. Such approaches, in general, 
however, are subject to the limitations of non-availability of sufficient corpora of text 
in machine -usable form, in the first place, and, even more importantly by the non- 
availability of satisfactory computer programs for complete syntactic analysis up to the 
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present time. — In terms of the state-of-the-art of automatic indexing, therefore, we 
shall not consider these approaches as more than indications for future research. A few 
suggestive examples are discussed briefly below. 

The multi-pronged attack on mechanized information selection and retrieval 
problems headed by Salton and his associates includes the exploration of tree structures, 
to represent both the relationships between terms in a classification schedule or indexing 
term vocabulary and the representation of the results of automatic syntactic analyses of 
natural language text. It is proposed, then, that computer programs can achieve trans- 
formations of the syntactic trees representing word strings in the original text into 
simplified, condensed structures with normalized terms and can compare these trees 
with the classificatory trees (Salton, 1961 [516]). Manipulation of such trees together 
with appropriate dictionaries or thesauri can result, for «. given proposed index term, in 
the finding of a preferred term for a particular system, or a set of synonymous terms, or 
sets of all terms in which the given term is included, and the like. 

Anger considers some of the problems involved in complete syntactic analysis of 
texts with the objective of identifying the total network of relationships expressed and 
implied, as proposed by Lecerf, Ruvinschii, and Leroy, among others, of the Research 
Group on Automated Scientific Information (GRISA), EURATOM. Assuming that computer 
programs for syntactic analysis are or will be available, he suggests that simplifications 
may be obtained by determining only the basic relations that are indicated by direct 
syntactic dependencies or by linking words, (Anger, 1961 [151 ). 

A specific program for automatically extracting syntactic information from text has 
been studied by Lemmon (1962 [354] ). The possibilities for combining dictionary lookups, 
word suffixes as indicators of syntactic role, and predictive syntactic analysis for text 
processing have also been further explored by Salton himself (1962 [ 518], 1963 [ 519] ). 

A variety of word and document association techniques and of synonymous word and 
phrase groupings which serve to "clue" the selection of a subject heading are also being 
investigated by members of the Harvard group and guest investigators. 



Major difficulties have to do with limitations both upon grammars and vocabularies 
so far tested and with ambiguities and the number of alternative parsings generated. 
See, for example, Bobrow, 1963 [68]. Kuno and Oettinger, 1963 [34l] and 
Robinson, 1964 [ 502]. Bobrow provides a survey of syntactic analysis programs 
as of 1963, noting limitations or restrictions on each. He reports, for example, 
that available programs to compute word classes are not always correct in the 
class assignments made and that analysis systems are not complete unless they 
provide means for distinguishing between "meaningless strings and grammatical 
sentences whose meaning can be understood". He concludes: "Until a method of 
syntactic analysis provides, for example a means of mechanizing translation of 
natural language, processing of a natural language input to answer questions, or a 
means of generating some truly coherent discourse, the relative merit of each 
grammar will remain moot.” ( [68], p. 385) Robinson ( [502], p. 12) says of 
sentences which can be parsed correctly, that they are: "Usually short sentences 
with no complicated embeddings of relative clauses and few participial or 
prepositional phrase modifiers. These include the basic sentences that most 
grammars are equipped to handle and that adult writers seldom produce." 



Another partial approach to applying syntactic analysis techniques to automatic 
indexing is based upon syntactic word- class recognitions. Giuliano and bis associates 
at Arthur D. Little, Inc. , (1963 [230] ), have investigated on a small-scale basis the 
use of the Kuno-Oettinger programs developed at Harvard for this purpose (Kuno and 
Oettinger, 1963 [340]). The broad program of information and language data processing 
research at System Development Corporation specifically includes investigations of 
structural patterns of sentences at the syntactic level and also of semantic factors 
such as the studies of polysemy and homographic ambiguity by Doyle, Wasser, and 
others. Borko reports: 

". . .We. . x are analyzing actual written text for multiple meanings. . . The data 
for this study were drawn from the corpus of 618 psychological abstracts. 
Tabulations of frequency of paired and single word listings were used. A 
number of corpus -derived word frames have been prepared. Although this 
research is still in its early phase, we feel that we have made a good start 
on the problems of semantic analysis. n U 

In Czechoslovakia, at the Karlova Universita, both statistical and semantical methods for 1 
automatic abstracting are reported as being tinder consideration. £/ 

Other examples of proposals for the use of syntactic analysis techniques for the 
improvement of automatic indexing products include tho<*e of Spangler, Levery, Plath, 
Thorne, and Climenson and his colleagues at RCA, as well as the suggestions of those 
whose interests in automatic syntactic analysis have been primarily directed to problems 
of machine translation or more general problems of linguistic analysis. Hays, for 
example, although principally concerned with MT, indicates that the methods for 
determining phrase structures have obvious applications to the automatic determination 
of categories useful in the indexing of documents. 37 

An existing GE-225 computer program for KWIC-type indexing from both titles and 
abstracts at General Electric's Phoenix Laboratories is being extended to incorporate 
word analysis features taking into account both syntactic and semantic aspects of a given 
line or sentence of text. / Levery provides an example of similar directions being ex- 
plored in European research, more generally oriented toward linguistic considerations 
as such than to machine -derivable criteria (largely statistical to date), which seek to 
combine the benefits of both human and machine processes by way of automatic syntactic 
analyses. He claims, for example, that: 
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"... The study of the position of keywords in the text and the syntactical 
relationship which exists among them will show the way to automatic ab- 
stracting and the use of more sophisticated retrieval systems." 1 / 

Plath suggests that, given a computer program to perform the parsing and 
syntactic diagramming of a text sentence, the results can serve quit? usefully to augment 
the selection criteria based initially on statistical techniques, such as word- frequency 
counting. He says, for example: 

"Another possible application of the outputs of the sentence diagramming program 
is their employment as an aid in language data processing for purposes of 
information retrieval, particularly in systems for automatic literature abstracting 
of the sort proposed by Luhn (1958). The feature of the tree diagrams which is 
pertinent here is that the main components of a clause, including subject, verb 
and object, always correspond to the 'main topics' in an outline, and are therefore 
located at the upper levels of the tree. When the words on these upper levels are 
considered apart ii om the lower -lev el structures which modify them, they often 
summarize the content of the sentence in a- sort of 'newspaper headline' or 'tele- 
graphic style'." hi 

The problems of multi-lev 3 I selection, or screening, such that machine programs 
for selection of the most probably significant words, phrases, or sentences can be 
focussed upon the most probably c onttr.t- relev at ory areas of text, are treated here, as 
also by Salton, in the sense of a cutting- c*£f at a w iven depth in the analyzed syntactic 
structure. 1/ A potentially important contribution to the future prospects for automatic 
indexing, however, lies in the "discourse anai/sis" and "transformational linguistics" 
approach of Harris (1959 [2543), where condensations and concentrations of similarities 
and differences of topical interest may hopefully be achieved. 

Harris himself suggested, at least as early as 1958, applications of his approach to 
both automatic indexing and abstracting. A goal of the analyses he has proposed is to 
identify 'kernels' of linguistic expression, having first, by various transformations such 
as from passive to active voice, brought together different ways of saying the same thing. 
He then suggests not only machine operations to normalize by application of his trans- 
formational rules but also to determine: 

"... Which kernels ha*, e the same centers in different relations (e. g. , with 
different adjuncts), and other characterizing conditions. The results of this 
comparison would indicate whether a kernel is to be rejected or transformed 
into a section. . . of an adjoining kernel, or stored, ard whether it is to be 
indexed, and perhaps whether it is to be included in the abstract." 
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Certain difficulties are se 1 f- evident. Consider, for example, the admittedly hypothetical 
text which m ( ght refer in various places to the "dissolute, disreputable, illiterate, elder 
L’ncoln" (underlining supplied) and which might be so processed by machine as to imply 
that Lincoln the son was, although also President of the United States, "dissolute, " 
"disreputable, 11 "illiterate," and "elder." These, however, are difficulties that plague 
almost any machine processing of natural language text. 

Climenson, Hardwick, and Jacobson have explored some of the possibilities of the 
Harris approach in experimental computer programs for the RCA 501 (1961 C 133] )• 
Specific features of these programs include; 

1. Establishment of the syntactic class or classes to which a given word can 
belong, by dictionary lookup. 

2. Investigations of sentence structure and context in an attempt to resolve the 
homographic ambiguities involved when the same word may function either 
as a noun or a verb. 

3. Isolation and marking of sentence segments, such as noun phrases, pre- 
positional phrases, adverbial phrases, and verb phrases. 

4. Identification and marking of segments -- clauses or degenerate clauses. 

On a very preliminary basis, a limited set of word and phrase deletion rules were 
set up and several sample documents were processed against them, yielding reductions 
to about 35 percent of the original text. These results suggest that "syntactical filtering 
criteria" might be applied to the improvement of modified derivative indexing techniques, 
such as the word-frequency counting techniques, either by deleting syntactically insignifi- 
cant parts of selected sentences, or by counting identical phrases rather than words. The 
investigators conclude, however, that; 

"A formal linguistic approach to the problems of natural language processing 
promises to yield results vital to the success of automatic indexing and data 
extraction. But the work required in such an approach will be quite arduous; 
a long-range man-machine effort will be required to formulate practical 
machine programs for indexing and abstracting. " U 

A final special case of linguistic data processing involving syntactic analysis is 
that of Langevin and Owens. They claim; 

"A critical review of the analysis work done on the Nuclear Test Ban Treaty 
by use of the Multiple Path Syntactic Analyzer demonstrates that such a device 
can, even at present, provide a powerful technique for the systematic discovery 
of ambiguities in treaties and other documents. Because the analyzer operates 
without bias from the overall context of the document, it may sometimes be 
possible for it to discover ambiguities that would easily escape a human reviewer 
who knows what the document is 'supposed to say'. " — ' 
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6. 4 Probabilistic Indexing and Natural Language Text Searching 

As in the case of automatic indexing proposals based upon automatic ser'.ence 
extraction techniques, machine searching of full natural language text has been suggested 
as a basis for, at least, automatic derivative indexing. We have remarked previously 
that the machine use of complete text can only be considered to be "indexing" in a very 
special sense, that it is subject either to the non-availability of suitable corpora already 
in machine- usable form or to high costs of conversion to this form, and that too little 
i3 yet known of linguistic analysis and searching -selection strategies effectively applicable 
to natural language materials. Various examples of corroborating opinion, other than 
those previously cited, are as follows: 

"Machine searching is superb if it is known exactly how to describe the object of 
search, and if one could know how to choose from among many possible search- 
ing strategies. I doubt if any one is yet in this comfortable position with respect 
to machine searching of text. " A/ 

"The most effective programs in automatic linguistic analysis have served only 
to illustrate how really complex is the structure of the language, and how far 
removed the present state of the art is from any system which might be useful 
in practice. "2/ 

"The recognition of woids involves only the matching of digital codes, but 
the recognition of an idea is a severe intellectual problem, the solution to 
which will probably never be exact. Nevertheless, this is the problem which 
must be attacked if accuracy is ever to be attained, or even approached, in 
using the text of information items as a basis fo.v their recovery. " 

Nevertheless, some of the work both in natural language text searching and in 
"probabilistic indexing" (where weights representing judgments as to degree of relevance 
of an indexing term co an item are used either in indexing or search), provide instructive 
insights into some of the problems of automatic indexing. 

In the period 1958-1960, work at Ramo- Wooldridge resulted in the release or 
publication of provocative papers by Maron, Kuhns, and Ray on "probabilistic indexing" 
(1959 [398], i960 C 397] ) and by Swanson on natural language text searching by computer 
(i960 [ 587, 582], 1963 C 583] ). Subsequent work along these lines has included further 
developments at Thompson Ranio-Wooldridge, the law statutes work at '.he Health Law 
Center at the University of Pittsburgh, and the experimental investigations of Eldridge 
and Dennis in a project jointly sponsored by the American Bar Foundation, IBM, and the 
Council on Library Resources. 
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6.4.1 Probabilistic Indexing - Maron, Kuhns, and Ray 

The work in the area of "probabilistic indexing" involves, as in the case of Stiles' 
statistical association factors, an assumption that there should be machine means avail- 
able for the automatic elaboration of search requests in order that relevant documents not 
indexed by the precise terms of these requests may be retrieved. Given that measures of 
"closenesses" and "distances" between similar documents can be obtained, probabilistic 
weighting factors between index terms assigned to documents may be made explicit. 

More generally, however, the notion of probabilistic indexing is based upon the assign- 
ment of weights that provide a numerical evaluation of the probable relevance of index 
terms to a particular document, and of the relative importance of the various terms 
used in a search request. Maron and Kuhns {1963 [39?] ) thus consider the following 
variables important in the formulation and following out of search strategies: 

1 . Input- both the terms of the request and the weights assigned to them. 

2. A probabilistic matrix giving dissimilarity measures between documents, 
significance measures for index terms, and closeness measures between 
index terms. 

3. A priori probability distribution data. 

4. Output- a class of retrieved documents ranked in order of their "computed 
relevance numbers" and an indication of the number of documents involved 
in the class. 

5. Search parameter controls, such as the number of documents desired. 

6. Search prescription renegotiation involving amplification of the request by 
adding terms "close" to the ones in the original request and the selection 
of additional documents following distance criteria for the collection. 1/ 

Experiments have been reported for 40 requests run against 110 articles taken from 
Science News Letter . Without search renegotiation, the "answer" document was 
retrieved in only 27 of the 40 tests. Three alternative methods of request elaboration 
were then tried. First, additional terms most strongly implied, statistically, by the 
terms in the request were used. Secondly, those terms were added which most strongly 
imply, again in a statistical sense, each of the given request terms. Thirdly, co- 
efficients of association between index terms were used. Results are reported as follows: 

"(1) Using the method of request elaboration via forward conditional 

probabilities between index tags, we retrieved the correct answer 
document in 32 cases out of the 40. 

(2) Elaborating the requests via the inverse conditional probability heuristic, 
we retrieved the correct document in 33 ofjhe 40 cases. 

(3) Using the coefficient of association to obtain the elaborated request we 
obtained success in 33 cases of the 40. 
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“Thus we see that the automatic elaboration of a request does, in fact, catch rele- 
vant documents that were not retrieved by the original request. 11 1/ 

6.4.2 Natural Language Text Searching - Swanson 

The work in automatic indexing and related research directed by Swanson at Ramo 
Wooldridge Corporation has included ” indexing at the time of search” in natural language 
text searching, (i960 [ 582 , 5873, 1963 [5833}, the previously mentioned studies of 
machine-like indexing by people (Montgomery and Swanson, 1962 [42l3), and automatic 
assignment indexing using pre-selected lists of clue words, (Swanson, 1963 [5803)* The 
last of these three major areas of investigation is the one of the greatest interest in this 
present study, but the earlier experiments in machine searching of natural language texts 
warrant some discussion. In his reports on this text searching project, Swanson has 
specifically claimed that the methods for transforming search questions can serve as 
the basis for an automatic indexing method. Thus: 

. . A technique for automatic indexing can be derived immediately from a text 
searching technique. . . it is necessary only to so organize the machine procedures 
that those operations of text reduction or reorganization common to all searches 
are performed only once and prior to searching in order to create directly an 
automatic indexing procedure. 11 2/ 

Swanson has also claimed that if automatic searching of full text is not feasible, 
then automatic indexing is not feasible, the one being prerequisite to the other. For 
example: 

“Clearly, if a computer technique for search and retrieval from the full text 
of a collection of documents cannot be developed, then it is unthinkable that 
matters could be improved by using the machine to operate on just part of 
the information (a ‘condensed representation*) -- that is, on an automatically 
produced index. This line ci argument demonstrates persuasively that the 
development of techniques for automatic full-text search and retrieval is a 
prerequisite to automatic indexing. It is equally clear that a technique for 
automatic indexing can be derived immediately from a text- searching tech- 
nique, and thus that the two processes involve conceptually equivalent 
problems. ” 3 / 

In the actual text searching experiments, a model ‘’library 1 ’ consisting of 100 short 
articles in the field of nuclear physics was set up in machine -us able form. These articles 
were also studied by subject specialists who rated the relevance of each paper to each of 
50 questions, and assigned weighting factors representing the degree of judged relevance. 
A second group of people, who knew only that the papers were in the field of nuclear 
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physics, then transformed the 50 questions into search prescriptions using three 
different methods. The first method for the development of the search instructions was 
to choose appropriate index entries from a subject heading list tailored to the contents of 
the sample library. Search was then made manually against a card catalog which 
recorded the results of manual indexing of the same 100 articles to the entries of this 
list. 



The second method of search prescription te"ted imoived the specification of 
combinations of words and phrases likely to be found in any paper which would in fact be 
relevant to the search question. The third method involved modification of the second by 
the use of a thesaurus -type glossary which suggested various alternative terms. Both 
the latter two types of search instructions were fed to a computer program which carried 
out searches against the natural language text consisting of 250,000 words from the 
original articles. 

The results were then evaluated in terms of ratings of relevance made by the 
physicists who had analyzed the papers. Retrieval effectiveness was not high: "... in no 
case did the average amount of relevant material . . . retrieval (taken over 50 questions) 
exceed 42 per cent of that which was judged ... to be present in the library. " ±J However, 
the results were indicative of the superiority of the machine methods to the manual cata- 
log search. 2/For this library in particular, in the case of "source documents" (the 
articles from which the search questions were taken), only 38 percent of the relevant 
papers were located by the manual search, whereas 68 percent of the relevant items 
were retrieved by machine search of the text for specified words and phrases in various 
"and 11 and "or" combinations. Machine search based on search instructions that had been 
developed with the assistance of the thesaurus- glossary yielded 86 percent of the relevant 
source item documents. 

6. 4. 3 Full Text Searching - Legal Literature 

"The retriever of documents may be satisfied with a sample of descriptors that 
represent the contents; the fact retriever or the question answerer must often have 
access to every word in the text". 1/ The objective of fact retrieval is a major goal in 
the experimentation that is being carried forward in the field of natural language text 
searching of legal material, especially the texts of statutes of the State and Federal 
Governments. The most extensive program to date is that of Horty and his colleagues 
at the University of Pittsburgh Health Law Center (i960 [277], 1961 [276, 309], 1962 
[196, 278], 1963 [24, 28Cl). 

Wilson at the Southwestern Legal Foundation is experimenting with a modified 
version of the Horty- Pittsburgh System for legal cases dealing with arbitration in five of 
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the southwestern states.— A joint American Bar Foundation — IBM research program has 
been established to explore both text searching without prior indexing and automatic in- 
dexing techniques (Eldridge and Dennis, 1962 E 183], 196 3 Cl82]). 

In the Horty- Pittsburgh System, approximately 6, 000,000 words of text have been 
converted via Flexowriter to magnetic tape. An exclusion dictionary of ^.00 words is used 
to eliminate the most common words and a word-concordance is prepared, resulting in 
word- occurrence location indicia by position in sentence, paragraph and section of the 
statute. In searching, the user has available to him the alphabetized list of approximately 
17, 000 different words and it is up to him to think of the words and synonyms most likely 
to occur in statute sections likely to be the ones he seeks. Severed search logics are 
available. One provides that at least one of a group of alternate words must appear; 
another requires that at least one from two or more groups must appear in the same 
sentence. Intra- sentence distance criteria are also utilized; "If the phrase 'born out of 
wedlock' is sought, the operator. . . requires that the word 'wedlock' appear in the same 
sentence, no more than three words after 'bom'. " — ' 

Obviously, for the same question the searcher would also have to specify synony- 
mous words and phrases- -"illegitimate children", "illegitimate births", "unwed mothers", 
"unmarried mothers", "illegitimacy", "bastardy", and so on. The reported success of 
the system is apparently due in large part to the ingenuity of the searchers in specifying 
the expressions and synonyms most likely to be used. Hughes comments as follows: 

"It should be noted that this system will be most efficient only when the users 
are thoroughly familiar with the linguistic style of the source material and 
search is made on words known to occur in the appropriate statutes". 

6. 5 Other Examples of Related Research in Linguistic Data Processing 

Since, as Garvin has emphasized, "All areas of linguistic information processing 
are concerned with the treatment of the content, rather than merely the form, of docu- 
ments composed in a natural language, " much of the research in linguistic data 
processing is potentially applicable to both the development and the improvement of 
automatic indexing techniques. Thus developments in automatic content analysis, in 
psycholinguistics, in question- answering systems, may eventually find application to 
mechanized indexing systems. 
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In terms o£ our present concern, however, we shall select only a few examples. 

"By automatic content analysis is meant the use of computer programs to detect or select 
content themes in a sentence -by -sentence scanning of text or verbal protocols". 1/ The 
interest of psychologists in machine techniques to assist in the analysis of linguistically- 
given materials, as in propaganda analysis, probably precedes at least in sophistication 
if not by date, that of documentalists or of machine specialists interested in library and 
information problems. Jl! 

3/ 

The “General Inquirer” program developed by Stone et al , — is an example of 
question- answering techniques based upon selective extractions from natural languag: 
text. It involves the use of a master vocabulary consisting of words previously selected 
by an investigator as being likely to be content -indicative in a body of material to be 
processed, together with his pre-established indications of the categories he expects 
their occurrence should predict. It is to be noted that this is a custom- tailored set of 
categories anu of clue-word lists associated with each, manually pre-established. Text 
is now processed in such way that each word is looked up and, if it appears in the master 
vocabulary, it is tagged with identifiers of the categories for which it is presumably 
predictive. A subsequent "Tag Tally" routine then counts the tag frequencies to deter- 
mine for which categories the input material has high or low scores, and these in turn 
can be compared with expected norms. 

This type of program has been applied to such varied materials as suicide notes, 
folk tales from different cultures, reports of field workers, recordings of group dis- 
cussions as in supervisory -leadership training sessions, and protocols for various 
psychological tests. Interesting variations developed by Jaffe and others ® ] involve 
the use of non-verbal as well as verbal clues as content-indicators, specifically, time- 
sequence patterns recorded along with the words spoken in client-therapist sessions. At 
the meeting of the Association for Computational Linguistics and Machine Translation 
held in Denver, August, 1963, Jaffe reported findings indicative of positive correlation 
between the structure of temporal and lexical patterns in dialogue and suggested applica- 
tions to automatic abstracting or indexing by the use of the time- sequence patterns as 
clues to high information -value areas. 
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Hughes provides* as of September, 1962 ([284] ), a critical review of several 
experimental and proposed question-answering systems using natural language statements 
and natural language queries, including "BASEBALL", 2/ "SAD SAM" U and the "Proto 
Synthex” investigations of System Development Corporation. U Later developments on 
the Synthex (synthe sis of complex verbal material) project at SDC have included a 
variation on a natural language text searching program where ordinary text input is run 
against an exclusion list and a table is set up to tally the substantive words remaining. 
Words with the same roots or previously having been identified as synonymous are cross- 
referenced. A complete index results, with document location identifier tags for the 
word occurrences down to the single sentence level. This index can be used subsequently 
to locate regions of text (volume, chapter, paragraph, and sentence) where answers 
responsive to input questions are likely to be found. 

X t is proposed that the Synthex system eventually should incorporate analyses of 
syntactic and semantic relationships in the linguistic expressions of both queries and 
text. Of future interest in the extension of such considerations to automatic indexing and 
abstracting are the following comments; 

"The results of several early experiments within the project, coupled with the 
findings of other language researchers, led to the following conclusions about 
meaning and grammatical structure in English text: 

1. The degree of synonymity in meaning between any two English 
words can be measured quantitatively with a synonym dictionary 
and relatively simple scoring procedures. 

2. The difference in meaning between two sentences of identical 
syntactic structure can be expressed quantitatively as a function 
of synonymity of their words. . . " 1/ 

It is also of interest to note that although the "indexer" program of the Synthex 
system provides cross-referencing between, for example, "whales” and "whaling" or 
"England" and "Great Britain", the investigators admit that; "naturally it falls short of 
such complicated cross-referencing as 'mouse- animal 1 'Jones person' and other 
concept recognitions. 11 However, concept recognitions based upon both a priori and 
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a posteriori associations are at least foreshadowed in a small-scale model of attribute- 
words and proper names, together with pre specified relationships between them; U in 
Olney's recent work at SDC exploring the possibilities for use of cognitive concepts as 
bases for establishing association between documents, i'/and by Kochen ’s work on machine’ 
inference and concept processing. if 

A final example of potentially related research in the area of content analysis is 
therefore the work of Kochen, Abraham, Wong and others at IBM's Thomas J. Watson 
Laboratories (1962 [ 329]). While concerned principally with adaptive organization and 
processing of stored factual statements and the possibilities for machine formulation 
of "hypotheses" about these and additional facts, some consideration has been given to 
sampling procedures applicable to determination of similarity which might be used for 
document clustering and to the possibilities for dynamic clustering for retrieval based 
upon a specific individual query. — In the proposed AMNIP (Adaptive Man- machine Non- 
Arithmetic al hiformation Processing) system,, there is no attempt at either automatic 
indexing or automatic abstracting. 5/ Instead, formal statements are made about named 
"things" and their attributes. The sharing of common attributes then serves as a basis 
for relating items which are similar and for grouping them together in the system 
memory. It is assumed that the organization of the stored statements changes dynami- 
cally with new data inputs and user feedback in question- answering routines. 

Where the named items are names of documents or of ird^x terms, a number of 
documentation applications can be considered. Where the items are document names 
and the formal predicate is "cites", the system provides a procedure for production and 
use of citation indexes, it Where the items are index terms or subject headings and 
the predicates are "is used synonymously with" or "is subsumed under", machine 
construction of a growing thesaurus based on use is suggested. —• The common attribute 
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matching program t applied to logical similarities of texts related as by having various 
assigned descriptors or citations in common, might provide a basis for generating 
document surrogates by representing each text in a related group of texts with the words 
or sentences these texts have in common. A/ 

In the case df man- machine interaction during search, it is suggested that the user 
should indicate the names of selected documentary items which are of parti cular interest, 
then: 



''The machine forms an 'hypothesis' about the subset of articles likely to be of 
interest. It does this by examining all recorded statements com ion to the ones 
selected but not to the rejected ones. The weight of different attributes and 
degree of interest is taken into account. The machine may display this hypo- 
thesis or another random sample of titles consistent with it, or both." 

6. 6 Machine Assistance in Translations of Subject Content Indications to Special Search 
and Retrieval Language 

There are, also, in the areas of directly and indirectly related research, certain 
programs of research, development, and experimentation which include investigations 
of possibilities for using machines to assist in the "translation" of textual languages into 
special intermediate or "documentary" languages. Doyle's use of the inclusion list 
principle to extract specified content -indicative words and to encode them in his "bigram" 
index was an early but relatively trivial example. -7 The work of Williams and her 
associates, at Itek and elsewhere, A / has involved the objectives of determining which of 
the subject- revealing implications of titles, abstracts and, if necessary, full text, are 
susceptible to machine detection and manipulation such that the implied as well as the 
explicit assertions made in a document may be incorporated in a formalized language for 
retrieval. 

While Williams, Barnes, Cardin and Levy, and others, have so far approached 
such tasks primarily from the standpoint of human analytic judgments, Coyaud (1963 
C 143]) has discussed at least preliminary work looking toward the automation of the 
analysis of natural language texts for purposes of encoding and organization of the terms 
and relationships to be used in the "documentation language" known as "SYNTOL" 
( Synta gmatic Organization of Language), this work has used a corpus based on biblio- 
graphic abstracts from the Bulletin signal* tique of the Centre National de la Recherche 
Scientifique, Psychophysiology Section, for the period 1958-1960. Notwithstanding such 
difficulties as determining rules for proper subdivisions of text, reduction of synonyms, 
resolution of lexical and syntactic ambiguities, and the fact that some words are always, 
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but some never, used in SYNTOL itself, he reports that both substantives and textual 
exp/essions indicative of certain specific SYNTOL relations can be unambiguously identi- 
fied. Contextual clues are used: for example, if the word "homme" occurs it is trans- 
lated as "sexe masculin" if "femme" also occurs, as "etre humain" if "animal" is also 
mentioned, and as "suj et experimental" otherwise. 

Mel to;? and her associates at the Center for Documentation and Communi cation 
Research, Western Reserve, have also been investigating machine processing of input 
text with a view to the automatic selection and manipulation of clue words and relation- 
ships between them for information retrieval purposes. Their material consists of 
abstracts from the metallurgical section of Chemical Abstracts. From sample abstracts, 
a lexicon is developed which involves classification of words into those that are signifi- 
cant from a metallurgical point of view; those that name materials, compounds, environ- 
ments; those denoting processes; those denoting characteristics of materials; preposi- 
tions; those which will not operate in the analysis of the text, and the like. 

On the basis of analysis of a number of sentences from the sample text, rules for 
combination and selection of specified words in specified relationships can be set up. 
These rules are designed to identify sentence types which: 

(1) Describe performance of a process on a material. 

(2) Discuss a material in terms of properties, components, form, or 
environment. 

(3) Describe a process without reference to specific materials. 

j (4) Discuss metallurgical properties without reference to specific 
materials. 

(5) Discuss two or more materials, properties or processes. 

(6) Describe a causal relationship between two properties. 

(7) Give a comparison of materials. 

(8) Contain no words of interest in the system. 

Computer programs to explore the possibilities for automatic analyses of the 
kind developed manually for the sample abstracts will be written with the objective of 
finding an effective compromise between mere word identification and total linguistic 
analysis. Melton says: 

"If one considers this method of analysis from the point of view of the linguist, 
he can immediately describe many grammatical constructions, which will 
prevent the meaningful reduction of these sentences. It is not known at this 
time how often such sentences will appear in the corpus of this investigation. 

Nor is it known how adversely such failure would affect the retrieval of the 
information In these sentences. The answers to these questions will be 
available only after a large sample has been analyzed and put to an extensive 
retrieval test. At its most successful the project will achieve an automatic 
processing of metallurgical text which will permit retrieval of the type o' 
information which can be stated in its own terms with a tolerable amount of 
inappropriate selections. Should this goal be unattainable, the project will 
have generated a file of abstracts automatically searchable on the word level 
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or somewhat beyond. For the benefit of other research it will also have 
produced tapes of the true text of a large sample of natural- language ab- 
stracts and a lexicon containing all the words of a corpus of current 
scientific literature. 11 

6. 7 Example of a Proposed Indexing-System Utilizing RelaJcd Research Techniques 

In addition to the automatic assignment indexing and automatic classification 
techniques for which experimental results have been reported, several other techniques 
and programs have been proposed. One is the joint American Bar Association-IBM 
research program (Eldridge and Dennis, 1963 C 182 j), for which discussion has been 
deferred because of its proposed use of several of the research techniques covered 
previously in this section. The experimental corpus will consist of the full text of 
approximately 5, 000 legal case reports taken chronologically from the Northeastern 
Reporter. Approximately half of this material will be processed to obtain word frequency 
counts. The frequencies will then be used to prepare for each different word an estimate 
of the skewness of its distribution in the collection. The investigators will then personally 
inspect the word list as ordered by skewness to divida it into "non -informing" (Type I 
words, or an exclusion list) and "informing" (Type II words, or an inclusion list) at some 
appropriate cutting point. Then, for each document, a list will be prepared of its 
"informing" (Type II) words, maintaining order within the document. For each pair of 
such words, statistical association factors will be computed. Eldridge and Dennis 
describe other aspects of their proposed technique, in part, as follows; 

"For each document in the body of 2, 500 cases, a list will be prepared of its 
Type II words, maintaining their original order within the document ... For each 
Type II word an ’association factor' will be calculated for every other Type II word 
with which it appears in any one document by compiling the probability that Word A 
would appear this close to Word B this number of tries over the entire file, if 
the Type II words were distributed at random. (This amounts to borrowing Stiles’ 
idea of the association factor, but implementing it with a numerical method which 
takes into account nearness of the words within the document as well as the fact 
that they both occur in the same document. ) Since the factors are probabilities, 
they will be numbers between zero and one . . . These numbers will be used to 
estimate the distances between words in index-word space. 

"The next step is to construct from the information about distances between pairs of 
words an index-word space in which every word is at the correct (or approximately 
correct) distance from every other word in the system with which it exhibits 
association. The result of this operation can be visualized schematically as a sort 
of grid in which every word can be placed in its appropriate position by assigning 
it a set of coordinates. " 
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"Indexing of the remaining cases in the experiment will be performed by machine 
from full te'^t, using the Type I list of discard words and the Type II list to pre- 
pare an anal) ns of the frequencies related to index-word space. Instead of 
selecting specific words as indexing terms, concepts will be selected (statistically) 
as volumes in index-word space. A rough physical analogy to this process would 
be to toss pennies at the previously mentioned grid so that, for every Type II 
woid in the source document, a penny lands at its proper slot on the grid. Where 
the pennies heap up in a pile, you have a concept. " 

"Searching will be carried out essentially by indexing a question presented 
narratively, determining the concept volumes that represent the question, and 
searching those volumes in document space for the relevant document numbers. 
Since the 'edges 1 of the concept volumes are determined statistically, output can 
be listed in order of probable relevance; as an option the question could be 
accompanied by a request that ’at least 100 references be supplied', in which case 
the concept boundaries would be adjusted to provide that number. " U 

It will thus be noted that the proposed indexing and search program begins on a 
derivative basis to establish for one -half the experimental material the significant words, 
next combines word frequency with significant word distance data to derive probabilistic 
association factors between words, then develops clusters, and finally indexes the items 
in terms of the clusters rather than words so as to provide assignment rather than 
extraction of index terms. 



7. PROBLEMS OF EVALUATION 

We have noted, in the introduction to this report, that several fundamental and 
highly controversial questions can be raised with respect to the feasibility and evaluation 
of any automatic indexing scheme and with respect to the evaluation of any indexing 
systems whatsoever. Yet if automatic indexing procedures are to be based upon previous 
human indexing or if their results are to be compared with human results, then the 
questions of the quality, the reliability and the consistency of human indexing are crucial 
ones indeed. Thus, Solomonoff warns: 

"The finding of exact languages for retrieval is also made less likely, in view 
of the fact that the categorizations of documents that are presented to the machine 
as a training sequence will not be performed altogether consistently by the human 
catalogei." 2 f 

Montgomery and Swanson ask whether human indexers are in fact self-consistent and 
consistent with each other, and they suggest; 
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"If the answer turns out to be ‘no*, we might reasonably conclude that the only 
reliable and e££ective kind o£ human indexing is that which is already machine- 
like in nature. *’ 1/ 

With a few noteworthy exceptions, there has been very little serious investigation of these 
problems and there is very little comparative data. 

O'Connor has been making a series of studies, with considerable emphasis upon how 
one might measure the products of machine indexing and how one might derive machine 
rules for automatic index in g from systematic review of documents indexed by people. 
Clever don and his associates at the ASLIB Cranfield project have extensively tested 
several different indexing procedures. Painter, MacMillan and Welt, Slamecka and 
Zunde, and others report findings on intra-indexer, and infer- indexer consistency -- 
unfortunately, on the basis of quite small samples. Various alternate approaches to the 
evaluation of automatic indexing results have been considered by Borko, Doyle, Swanson, 
Savage, Giuliano, and others. In addition, some data bearing on these questions have 
been reported in connection with analyses of selective dissemination (SDI) systems. 

Some data from other sources, such as studies of user preferences with respect to 
various reference and search tools, is also pertinent. 

The most generally accepted criterion for appraising the effectiveness o f indexing 
is that of retrieval effectiveness. But, in general, this is merely the substitution of 
one intangible for another, entailing a string of as yet un answer able or at least un- 
resolved questions.-^/ Retrieval of what, for whom, and when? How can effectiveness be 
measured except by the elusive question of relevance judgments? How can human judg- 
ments of relevance and value be measured and quantified? 

% 

We shall try to distinguish here, insofar as possible, between the core problems 
that make the evaluation of indexing as such an extremely difficult task, the available 
data on human indexer reliability, and ,;he possible advantages and disadvantages of 
automatic indexing techniques. 



U 

Montgomery and Swanson, 1962 [42l], p. 366. 

2 / 

Compare Swanson, i960 [582], pp. 2-3: "The performance of retrieval experi- 
ments when relevance judgments per se cannot be consistently assessed by human 
judgment would seem to represent overly vigorous pursuit of a solution before 
identifying the problem. " Similarly, see Black, 1963 [64], p. 14: "Finally, 
when one is faced with an existing collection of indexed materials, how does one 
assess the effectiveness of any retrieval system? Suppose that one receives 20 
documents as a result of a query to the system. Suppose further that all 20 docu- 
ments are quite pertinent to the topic of interest. Is there any way to assess the 
amount of pertinent information still unretrieved from the file? Or is there any 
way of learning whether the retrieved information is more pertinent than the un- 
retrieved information ? The answer is 'Nol' -- the use of any retrieval system 
Is, then, an act of faith in the quality of indexing." 
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7. 1 Core Problems 



First and foremost of the core problems implicit in the question of evaluation of any 
indexing scheme, whether applied by man, machine, or man-machine combinations, are 
those of interpersonal communication itself, which in turn relate to fundamental problems 
of epistemology. These are, first, the problems of language as a means of com- 
municating perceptions, apperceptions of relationships between present observations and 
i prior experience, and value Judgments based thereon, and, secondly, even more funda- 

mentally, the question and the veridicality of language representations of real transactions 
and events. Serious investigators in the field, including many who have themselves con- 
tributed to automatic indexing techniques, have made such typical acknowledgments of the 
difficulties as the following; 

11 The imprecision connected with discussion of retrieval effectiveness and of 
relevance is not due to lack of unde rstanding of the relatively straightforward 
retrieval processes, but is due to our lack of basic understanding about language, 
meaning and human communi cation itself. " 

11 Fundamentally, the study of inquiry procedures is a problem in the general 
psychology of cognitive functioning. Relevant problems concern the way 
problems are recognized and formulated into questions, the way a search plan 
is developed to find answers to questions, and finally, the way it is decided 
whether or not a possible answer matches the specifications of a question. 11 U 

A second core problem is the heterogeneous and somewhat arbitrary development of 
natural languages themselves. It is much the same fundamental problem whether men or 
machines are to read text and determine the. 11 meaning" (at least, in the sense of com- 
munication intent) of messages expressed in a natural language. However, the problems 
are aggravated if men themselves must know enough about language and its conveyances 
of message content to specify precisely to a machine what it is to look for and to use. 

Salt on enumerates some of these difficulties as follows: 

"No well-defined set of rules is known by which the individual words in the 
language are combined into meaningful word groups or sentences. Specifically, 
the correct identification of the meaning of word groups depends at least in part 
on the proper recognition of syntactic and semantic ambiguities, on the correct 
interpretation of homographs, on the recognition of semantic equivalences, on 
the detection of word relations, and on a general awareness of the background 
and environment of a given utterance. 11 3/ 
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Giuliano, 1963 C 230], p . 6. 
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Stone, 1962 [576], p. 1. 
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Salton, 1963 [519], p. I- 2. 
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Similarly, Baxendale states: 

"We are confronted with difficulties which arise from the multiple ways in 
which words and sentences are put together to convey meanings and shades 
of meaning — i. e. , to represent ideas and concepts. Research into this 
problem — drawing upon psychological and logical analysis — is scarcely 
begun. " U 

A third core problem is the proper choice of appropriate selection criteria if 
condensed representations of document content must be used for scanning, search, and 
relevance decisions. Swanson suggests that the price paid for brevity of representation 
so that searching operations can be efficiently managed is the loss of at least some, 
perhaps most, of the information in a collection or library. He notes also that: 

"It is another obvious but seldom remarked fac*' that the extent of such 
information loss for existing libraries is not only unknown hut has never 
defined in measurable terms. " U 

This loss is lived with, today, in many practical situations involving abstracts, index 
term sets, selective -dissemination notices, and even mere author -title listings in 
aitnoun cement bulletins or search output products from either manual or machine 
searches. Yet the sheer increase in volume of the total number of items to be covered 
and of the number of items potentially responsive even to a single individual's interests 
has severely stretched any individual's capacity to scan or skim, much less read, the 
presumably pertinent material -- documents themselves, abstracts of other documents, 
listings of documents available — already accumulating on his desk. 

Condensation, reductive representation, becomes more and more imperative. 
Concurrently, while conventional tools may be lived with, after a fashion, the sub- 
stitution of machine- compiled or machine -produced alternatives, even though they give 
the same information in the same volume, number of pages to be scanned, may because 
of such things as inferiorities of page and line formatting, size of type on the page, 
limitation of typography to upper case and a few other symbols, make the problem of how 
adequate the user judges the selection and condensation to be, that much worse. 

A fourth problem in evaluation, therefore, is the question of whether or not the 
benefit to users is worth the cost. For example, despite the arguments for concept 
rather than word indexing, for assignment of labels rather than mere extraction of a few 
words used by the author himself, at least some data on the use made by scientists of 
various sources of information on material which might be of interest to them suggests 
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Baxendale, 1962 [42] , p. 68. 
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Swanson, I960 [582], pp. 5-6. 
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that subject indexes are not the most important source, nor even a major source. Herner 
found, for example, that only about 16 percent of his respondents reported use of indexes 
and abstracts as primary tools in literature searches. He reports, for the use of tools in 
becoming aware of current sources of information, 477 of 3832 responses indicating the 
use of indexing and abstracting publications as against 486 using footnotes or other cited 
references, 1/ 291 using library acquisition lists, and 212 using separate bibliographies 
(Herner, 1958 [265]). 

These data, and similar findings of Fishendon that 17 percent of scientists queried 
considered the scanning of titles in accession lists and announcement bulletins a principal 
means to find information of interest, U suggest that KW1C type indexes may be adequate 
for many purposes. On the other hand, the KWTC index to the TJ.S. Government Research 
Reports made available to the public on an experimental basis through the Office ot 
Technical Services was discontinued after a year of subsidized operation because too few 
of the users indicated willingness to pay a fee in order to have the service continued on 
a subscription basis. 



The evaluational problem here involves the lack of information on indexing costs, 
the relatively few quantitative and objectively validated studies that have been made of 
user needs, the question of whether what the user says he does or wants is what he really 
wants or does, and the matter of defining “interest" for diff erent users with differing 
purposes and requirements. The concept of "interest" is taken to mean the motivations 
of a particular user or group of users at a particular time, while the equally imprecise 
notion of "relevance" refers to the value judgments made by the user as to the relation 
of an item to his query or interest. 

A final core problem* then, is that of the question of relevancy itself, involving 
recognition that "relevancy is a comparative rather than a qualitative concept . . . (and) 

. . . that a document of little relevancy in the eyes of X might well be highly relevant 
in the eyes of Y. " U Mooers states, similarly, that: 



"There is no absolute 'Relevance 1 of a document. It depends upon the person 
and his background, the work and the date. What is not relevant today ma/ be 
relevant tomorrow. " — 



Good discusses various possible measures of '.relevance' - logical measures* frequency 

measures, references to, citations of, interest measures, linguistic measures, 2J 
— 

Note that Hemer's data and those of Glass and Norwood, 1958 C 232] , reporting 
6.9 percent use of cross-citations in another paper as the method of lea mix g 
of important work as against 1. 2 percent using an indexing service, appear to 
re-enforce the claims of those who advocate citation indexing. 

2 / 

Fishenden, 1958 [ 197], p. 163. 

3/ 

Bar-Hillel, 1959 [33]* p. 4-8.4. 

4/ 

Mooers, 1963 [423], p. 2. 

5/ 

Good, 1958 [234], pp. 7-9. 
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but except for the obvious statistical criteria, the problems of how to measure relevancy 
remain largely unresolved. 

At least some data on the variability of relevance judgments is available in reports 
of the performance of an SDI (Selective Dissemination of ^Information) system. In such 
systems, the indexing terms or tags assigned to a new item are compared with a file of 
“user-profiles** that is, with a pre-prepared listing of terms or topics in which a 
particular user is interested. Where the term-profile of a new item matches that of a 
user, a notification of the acquisition of that item is sent to him. Barnes and Re snick 
report tests of such a system in which pseudo-notifications selected randomly were 
included with those produced from the matching procedure. Account was kept of which 
notices were regarded by the users as meeting their interests and which were not. They 
found that 58. 1 percent of the non- random notifications were regarded as relevant, buc 
that so also were 26.8 percent of the random ones, A/ 

Katter comments on findings that the inter subjective agreement of typical users 
with respect to value judgments of condensed representations of text is low. He 
suggests: 

“One source of this low inter subjective agreement among users may be that it is 
often not clear what is intended by the words relevant and representative. Con- 
siderations such as the validity of the material, its usefulness, stylistic qualities, 
understandability, conceptual preferability, etc., can all enter their judgments in 
unknown amounts. 11 j-/ 

Corroborating evidence is available from other sources- Swanson, in his tests of 
a natural language text searching technique, had first used subject matter specialists to 
rate the relevance of each of the text documents to each of 50 questions. Two individuals 
rated each item, and if they disagreed significantly, a third person was asked to reconcile 
the difference. In spite of this, 8 percent of the cases of failure to retrieve ’’relevant* 1 
documents were ascribed to incorrect initial judgments of relevance, and 15 percent of the 
presumably “irrelevant 1 * documents were finally judged to be relevant after all (Swanson, 
1961 i,586 } )- In Swanson*s words: “The question of formulating criteria for judging the 
relevance of any document to the motive, purpose, or intent which underlies a request for 
information is profound and lies at the heart of the matter. 11 1 / 
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Barnes and Resnick, 1963 [36], p. 2. 
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Katter, 1963 [308], p. 24. 
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Swanson, I960 [587], p. 1099. 



7. 2 Bases and Criteria for Evaluation of Automatic Indexing Procedures 



What should the bases he for the evaluation of existing or proposed indexing systems 
that rely, to a greater or lesser extent, on machine generation of the indexing or classi- 
ficatory labels? Since the evaluation of quality of indexing per se raises such fundamental 
and elusive questions, can these questions be begged for the case of automatic indexing as 
they are in fact for almost all manual systems? If so, the obvious bases are those of 
time, cost, availability of alternative possibilities, and customer acceptance. Here again 
we are faced with a dearth of objective data, even lor the inter comparison of any two 
manual systems. 

In the two years preceding the ICSI Conference, the Program Committee openly 
solicited papers that would provide comparative data for operating information systems 
and that would develop and discuss criteria for the comparison of systems. U Never- 
theless, of the papers received only two were responsive to this invitation: the special 
case of comparing the conventional file against the inverted file approach to the searching 
of chemical structure data (Miller et al, 1959 [419]), and an early report by Cl ever don 
on the ASXJB Cr anti eld project for the inter comparison of indexing systems, under a 
grant from the National Science Foundation (1959 [126]). 

There had been an earlier comparative experiment, generally conceded to be the 
first of its kind, 2/ in which 98 search requests were run by ASTIA personnel using a 
conventional catalog and by personn >1 of Documentation Inc. , using a coordinated Uniterm 
index. Warheit says: 

"Unfortunately, the conditions of the test were very poorly designed so that, 
in the final analysis, each group was the sole judge both of the scope of the 
original request and of the adequacy of the bibliographies produced. The 
resulting claims are of course contradictory. " — 



1 / 

See "Proposed Scope of Area 4," Proceedings, ICSI, 1959 [481], pp. 665-669. 

2 / 

Compare, for example Gull, 1956 [246], p. 329: "When one considers that a 
fairly thorough search of the literature indicates that this comparison of two 
reference systems is the first undertaken so far, it is not surprising that the 
results reveal clerical errors and an incomplete design of the test. n 

3 / 

Warheit, 1956 [631], p. 274. 



However, some of the findings are pertinent to our present questions of evaluation. 
Thus, of 492 items selected by Documentation, Inc. , that ASTIA considered pertinent but 
had not selected, 98 were missed by them although the proper subject heading was 
searched and the catalog card had adequate selection clues, 89 were missed because not 
all applicable subject headings were searched, 21 were missed because the original 
subject heading assignments had been inadequate, 7 were missed because neither title nor 
abstract provided indication that the report itself was pertinent to the request, and 102 
were missed "because the subject heading did not occur to the searcher or because there 
were so many cards under the subject heading that the searcher was discouraged". 1/ 
Similarly, Gull reports, of 318 items selected by ASTIA that Documentation, Inc. 
personnel considered relevant but had not themselves selected, 97 were missed because 
the searcher did not consult the proper terms. 

7. 2. 1 The Cranfield Project 

The inauguration of the Cranfield project is itself indicative of a prior lack of 
objective standards as applied to the measurement of effectiveness of information 
indexing, selection and retrieval systems. Beginning in 1957, and still continuing with 
respect to individual indexing devices such as synonym controls and role indicators, this 
work has attempted to compare different indexing systems (e.g. , UDC, Uniterm, etc. ) 
under different indexing conditions (e, g. , type of training of indexer, length of time 
allowed to index) against proposed measures of "retrieval effectiveness". These 
measures are, respectively, the recall ratio, or the percentage of relevant documents 
retrieved as against the total number of relevant documents known to be in the collection, 
and the relevance ratio, or the percentage of relevant documents among those actually 
retrieved. 

In the first Cranfield tests, on 18, 000 documents, it is reported that the recall ratio 
ranged between 75 and 85 percent for all four indexing systems. 1 j These results are 
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Gull, 1956 [246], p. 329. 
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Compare, for example, Randall, 1962 [492], pp. 380-381: "Prior to 1957, the 
proponents of the various indexing and classification schemes, the universal 
decimal system, the alphabetic subject heading, the Uniterm system and faceted 
classification touted their own system on the bases of subjective evaluation and 
theoretical investigations. There were many claims and much supposition about 
the relative merits and benefits . , . but there was no body of data from which an 
objective evaluation could be made. . . Many observers believe that the Cranfield 
study constitutes the most important work done in the field of cataloging in 
recent times. " 
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Cleverdon, et al, 1964 [130], p. 87. 
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rather better than reported by others — and have been subjected to specific criticisms 
although these first tests were limited to the recall of the source documents on which 
the test questions were based. For non- source documents there would of course also 
bo questions relating to the core problem of how relevance is to be judged. Thus Markus 
says: 



"Despite investigations by Cleverdon in England, and by many others, there is 
today no generally accepted method of comparing the effectiveness of different 
types of indexes. The needs of index users vary so greatly that even the most 
carefully planned tests of retrieval efficiency can be challenged. 11 / 

Notwithstanding such criticisms, however, and in spite of the fact that the Cranfield 
tests have so far been directed principally to indexing systems applied manually, certain 
findings and conclusions reached by Cleverdon and his associates are pertinent to the 
questions of evaluating automatic indexing procedures. Examples are: 

"The fact is that no indexing sleight of hand, no indexing skill, can produce a 
system in which a figure for recall can be improved substantially without 
weakening the over-all relevance, i.e. , the number of documents that are 
really relevant compared with the total number retrieved. 

"The majority of the failures (60 percent) were due to inadequacies and in- 
accuracies (carelessness rather than lack of knowledge) in the indexing process. 
However, supplementary tests, in which the staff of outside organizations carried 
out the indexing revealed that the Cranfield indexers were achieving a standard 
above average. This seems to indicate a certain inevitability of human weakness 
and error in the indexing process and lends some support to the many current ^ 
research projects that are investigating the feasibility of automatic indexing. " — 

7.2.2 O'Connor's Investigations 

As O'Connor has cogently observed on a number of occasions, the question of 
whether or not automatic indexing is possible is not the real question. Rather, the 
problem is whether or not indexing by machine is capable of pioducing results that are 
"good enough" for retrieval purposes, raising in its turn the still more basic question of 
how "good retrieval" can be evaluated. His own approach in detailed investigations has 
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See, for example, Johnson 1962 [300], p. 90: "The amount of meaningful 
information that can be retrieved is too small. There are few available studies 
on this subject. But these seem to indicate that, under some indexing schemes, 
meaningful retrieval can run as low as 10 and 15 percent and that the most that 
can be optimized for any of them, even under highly motivated conditions, is 
around 70 percent." 

2 / 

Markus, 1963 [394], p. 16. See also Kochen, 1963 [327], p. 12: "The out- 
standing large-scale and realistic experimental work is that of Cleverdon. 
Unfortunately, his results are not very decisive. " 
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Cleverdon et al, 1964 [ 130], pp. 86-87. 
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been to study an existing system (e.g. , using Merck, Sharp and Dohme data) with respect 
to indexing terms such as "penicillin," "toxicity," and "mode of action." He then 
attempts to define various possible machine assignment rules, and then to determine the 
probable over- and -under assignments that would result from the application of these 
rules. 



Typical results pertinent to both questions of word- indexing evaluation and of inter- 
indexer consistency showed that for 23 documents indexed under the term "toxicity," 11 
did not contain the stem "toxi. . . " at all; th£t 17 items indexed under "penicillin" contained 
the word at least once; that none of 34 randomly selected documents not indexed under 
"penicillin" contained the word, but that 7 of 28 items not so indexed but selected as 
probable candidates from title and other clues did contain the word. (O'Connor, 1961 
[447]) 



Typical suggestions, comments, and conclusions made by O'Connor include the 
following: 

"It might be required that the mechanized indexing permit as good (or no wors_) 
retrieval as existing human indexing, because it is desired to free the subject- 
skilled indexing personnel for other work. Or poorer retrieval (than possible 
with human indexing such as is presently done of comparable material) might be 
accepted from computer indexing, because poorer retrieval is better than none ^ . 
and there is a shortage of subject- skilled people to do the additional indexing." — 

"Such considerations as the following are relevant. Over-assigning can increase 
input costs and storage (to an extent dependent on the storage system), but 
mechanizing indexing might be worth the cost. Over-assigning might also 
increase the number of irrelevant documents retrieved, but the increase might 
be insignificant. " hi 

". . .Suppose terms A, B, and C each correctly characterize five percent of a 
ten thousand document collection, each term is overassigned to another five 
percent, and over- assignment of each term occurs independently of the correct 
assigning and over-assigning of the others. Then about nine documents will be 
extra for the search question A & B & C. " — ^ 



1 / 

2 / 

3/ 



"The question of permitting some under- as signing, that is, the computer failing 
to assign [a term] T to some document which should have it, is more delicate. 
Human indexers sometimes under as sign. If we knew the rate of ounderas signing 
by human indexers for a term T, we might consider allowing the computer a 
similar rate. However, some cases of under as signing might be more important 
than others and if the computer made more important mistakes than the human 
indexers, retrieval might not be 'good enough'." 



O'Connor, i960 [4441, p. 3. 
O'Connor, 1961 [448], p. 199. 
O'Connor, i960 [444], p. 6. 
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Other typical points made by O'Connor include the possibilities that the use of 
automatic indexing techniques might free trained technical people for other work, that it 
might permit more indexing than is now possible with available resources, that it might 
cost les.s, and that it might produce a better or more consistent indexing product.!/ 

With respect to the latter point, however, he points out that greater consistency might not 
in itself be a virtue, since the product although generated more consistently might be 
relatively worthless by comparison with the inconsistent human product. U Especially 
pertinent to the question of judgment factors in evaluation was a comparison of the most 
frequent words selected by the Luhn "auto- encoding" technique as applied to an ICSI paper 
against a quasi- random word list for the same paper produced by selecting the last non- 
common word on every page, and the first such word on every second page. He remarks; 

"The important point of this quasi- random list for my present purposes is to 
emphasize that first impressions might not be at all a good way of judging 
the adequacy of an index set. " U 



7.2. 3 Questions of Comparative Costs 



The paucity of objective data on the effectiveness of indexing systems generally 
extends to even such obvious questions as costs of indexing and time required to index. 
These very questions might, in fact, be decisive with respect to choice between manual 
and machine systems. It has been estimated by some that the costs of. manual subject 
indexing amount to close to 75 percent of the costs of operating .an information selection 
and retrieval system, !/ yet very little actual data on costs has been reported in the 
literature. - J Exceptions are, for the most part, limited to rather special cases, such 
as the following examples; 



1. A total cost of less than $30, 000 is reported for a 10, 000 document 

collection at Aeronutronic. Four man-years of effort were required. 

On average, 12. 6 access points were provided per document, of which 
9- 2 were subject- indicating descriptors chosen, with some modifications, 
from the second Edition of the AST1A Thesaurus. "This favorable figure 
was possible because an adequate ready-made thesaurus of indexing terms 
was available and because the ‘peek-a-boo* type equipment used was much 
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O’Connor, 1962 .[447]* p. 267. 
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O’Connor, 1963 [443J, p. 16- 
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O'Connor, 1962 L 447 J, p.270. 
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O'Connor, 1963 [442], p. 1. 
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See, for example, A. D. Little, Inc. ( 1963 [23], p. 5); "Performance and cost data 
on existing large documentation systems are surprisingly sparse, and cost data 
have rarely included adequate overhead and depreciation accounting. " 



less expensive than most other devices offering comparable speed 
of operation and search logic possibilities. M 1/ 

2. 1l The experience of libraries that have gone through indexing using 

links and role indicators and careful editing snows that indexing takes 
about one-half hour per document (or $4. 00) and costs an additional 
$1.00 for routine processing. “ 2/ 



3. In an investigation of the comparative merits of manual indexing of 2, 000 
documents using the UDC classification system as against a KW1C index. 
Black gives the figure of approximately $1400 for the UDC case compared 
to about $600 for an in-house computer operation to produce KWIC listings, 
and somewhat more for a KWIC index compiled by a service bureau.— 



Time required to index, which directly involves cost, is reported by Clever don to 
vary widely: 



"Few reliable figures have been given for current practices, although a particularly 
high figure is the 11/2 hours average quoted for indexing reports for the catalogue 
of aerodynamic data prepared by the Nationaal lluchtvaart labor atorium in 
Holland. It appears from personal discussions that an average of 20 minutes for 
a general collection of technical reports is the top limit, and this has been taken 
as the maximum indexing time to be used in the project." 4/ 



Insofar as such meagre data is indicative, there doe 3 not appear to be any particular 
cost- advantage for machine -compiled and machine- generated indexing other than the title - 
only KWIC indexes. Thus, Olmer and Rich report, in part: 



"The program . , , lends itself to a variety of applications. One of these ... is 
estimated to cost roughly $4.00 per document for cataloguing, putting on tape, 
printing and making any necessary corrections." 5/ 

This is for a case where the indexing (cataloging) is done manually. 



For a specific proposed automatic indexing system, employing a modified version 
of the Luhn word- frequency counting selection principle, Gallagher and Toomey report 
that: 
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Linder, 1963 [36l], p. 147. 
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Black, 1962’ [65], p. 318. 
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"For the documents in our system, we estimate that processing time will be 
about 20 seconds per thousand words • • • The cost is approximately $3. 50 
per minute when averaged between prime and extra shift. " 1/ 

This means that the cost of processing a 3, 000 -word document would be $3. 50 , exclusive 
of the costs of keypunching the input text which, conservatively estimated, costs not less 
than 1-2 cents per word. 2/ Swanson similarly assumes either that machine -us able text 
is already available or that editing and keystroking efforts are separate costs In arriving 
at an estimate of $1. 00 per item for automatic indexing. 2/ 

These quantitative estimates bear out the more subjective conclusions of such 
investigators as Bar-Hillel, O'Connor, and others. Examples are: 

"It is very likely that manual Uniterm indexing by cheap clerical labor will still, 
on the average, be qualitatively superior to any kind of automatic indexing, and 
it is very unlikely that the cost of automatic indexing will ever be less than this 
kind of manual Uniterm indexing, unless the automatic indexing is to be of such 
low quality as to totally defeat its purpose." 

"Most of these techniques require that the full texts of documents he in machine 
readable form. At present this usually requires keypunching which is much 
more expensive than a specialist's indexing efforts." Ji / 
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Gallagher and Toomey, 1963 [205], p. 52. 

2 / 

'Compare, for example, Ray, 1961 [496], p. 55; Swanson, 1962 [584], p. 470: 

The cost is roughly one or two cents per wora which by standards of what is 
normally spent for even the most thorough indexing and cataloging, is 
exorbitant. " Mersel and Smith report 1964 [ 41 5] , p. 10 A) typical TRW costs 
of keypunching as two cents per word for Russian technical text, and one cent 
per word for English. They also cite cost figures as low as half a cent per 
word at the CIA-Georgetown Keypunching Center in Frankfurt and at IBM, but 
this is exclusive of overhead and computer processing (e.g. , editing program) 
costs, so that the one cent figure appears minimal as of today. However, 

Kochen reports (1963 [327], p. 7): "While keypunching of text cost roughly one 
cent /ward, new means for recording spoken (and writte’n) text using a steno- 
keyboard tied to a phot -> disc storing a Stenocode -English dictionary could possibly 
reduce the cost to 1/3 -cent per word. " 
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Swanson, 1962 [584], p. 471. 
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- 1 Bar-Hillel, 1962 [35], p. 418. 
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O'Connor, 1963 [443], p. 1. 



7.2.4 Summary: Potential Advantages as Bases for Evaluation 



In view of the difficulties engendered by the underlying core problems, the 
criticisms that can be brought against tests of "retrieval effectiveness", the general lack 
of comparative data and standards of measurement, the question of evaluation of automatic 
indexing procedures largely reduces to the weighing of potential advantages and disadvan- 
tages. In the case of such procedures as KWIC and citation indexing, some of these 
possibilities, both pro and con, have been discussed previously. In general, suggested 
bases for evaluation reflecting operational considerations may be summarized as follows: 



1 . 

2 . 

3 . 

4. 

5. 

6 . 



Speed and timeliness 
Relative economy 
Consistency and reliability-/ 

Elimination of the need for further human intellectual effort after 

initial planning and programming has been done. 

Providing a product that could not otherwise be obtained. 

2 / 

Ease of updating and revision of indexes so produced. — ' 



From the point of view of possible operational advantages, these may be combined 
into the single criterion: 



The achievement of a more effective and more economical balance between the 
meeting of the objectives of the indexing system and the utilization of available 
resources. 



— ' Compare McCormick, 1962 {409], p. 182: "A computer is objective in its operations 

and it can be repetitive. If given a certain amount of information about a document, 
it is always able to index the document in a consistent manner. This consistency is 
desired so as to avoid the situations where a person might index a document differ-* 
ently on various occasions, or where it would be indexed differently by another 
person when there appears to be no good reason for a difference. 11 Note, however, 
O’Connor’s point previously mentioned, (1963 [443], p. 16): ”It has been argued 
that mechanized indexing has the advantage of consistency. . . However this argu- 
ment by itself says very F.tle in favor of mechanized indexing. For two humanly 
produced index sets for a document which differ somewhat may both be quite useful, 
though imperfect, while the index set which the same program will always reproduce 
for the same document may be worthless. 11 

-/ See, for example, Youden, 1963 [658], p. 332: 

"The facility with which indexes may be updated and the ease of selecting items for 
special bibliographies will result in the majority of indexes being computer produced 
before many years. " 
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However, tha question of the objectives of the system brings us back full circle to the 
questions of purpose in terms of particular requirements, of quality, and of how to 
measure either purpose or quality. Thus we may determine that an automatic indexing 
procedure produces a product ;.t least as rapidly, at least as inexpensively, at least as 
consistently as human indexing operations 'would, and with substantially less investment of 
manpower resources. However, will this product be as useful or as “good 1 ' as the human 
product? 

In view of the many caveats about the present quality of indexing systems-^/ and the 
lack of standards for measuring quality, Zj it is important to recognize that we should 
compare the products of automatic indexing methods "not with hand -crafted excellence, but 
with the average, the routine output of the over-burdened subject analyst working with the 
deficiencies of any other indexing system". 3 / Such deficiences include the critical 
question of how well and how consistently the system, whatever it is, is applied in practice 
by the human analysts. 

7.3 Findings with Respect to Inter-Indexer and Intra -Indexer Consistency 

Very few objective studies, despite the obvious relationship to the general questions 
of quality, pertinency, and reliability of indexing, have as yet been made of inter -indexer 
and intra -indexer consistency. Perhaps the first investigation both to obtain experimental 
data and to analyze the observed types of failures to achieve correct assignments was that 
of Lilley. 4 f He took the answers made to 6 questions by 340 students entering a graduate 
library school, wherein they were asked to write down the subject headings which they 
would expect to be applied to other books on the same subject as 6 "sample books" in a 
system such as the Library of Congress card catalog. Lilley reports: 



\j See, for example, in addition to comments by O’Connor and others previously quoted, 
Helyar, 1961 [262], p. 110: "The general current of feeling of the meeting as re- 
flected both in the papers and in the discussion is that the standard of indexing is not 
nearly adequate;" Artandi, 1963 [22], p„ 1. : "... ’Good indexing* an such has 
not been defined satisfactorily and is the function of many variables, some known, — 
others not yet identified"; Tritschler, 1963 [610], p. 5: "... ! Goo^‘ indexing is ex- 
tremely difficult to describe and ’perfect* indexing is impossible to define or 
measure. " 

2 / See Cleverdon, i960 [ 124], p. 429: "The most important requirement in information 

retrieval is a recognized standard of measurement and after that we need a satis- 
factory method of measuring. Only when these have been found will it be possible to 
know for certain whether any new system of indexing or retrieving information is an 
improvement on previous methods. At present all those trying to solve the problems 
of information retrieval are working very much in the dark, uncertain as to the real 
problems and quite unable to apply any measurements to their proposed solutions. " 

3 / Kennedy, 1962 [311], p. 126. 

4/ Lilley, 1954 [360]: See also Vickery, 1960 [626], p. 4. 



"A total of 2245 headings were suggested, averaging 1. 1004 headings per book per 
student. These headings represe-ited 373 different varieties, of which 368 were 
different from the headings traced on the Library of Congress cards for the sample 
books. . • As an average 62. 17 different headings were suggested for each book. . . 

"When the 368 different varieties of incorrect headings were analyzed in accordance 
with certain criteria that had been set up, it was found that incorrect specificity was 
a factor in 93.48%, incorrect terminology in 79. 08% and incorrect form of entry in 
72.28% of the headings. . . Over half of the incorrect headings (54. 62%) had some 
combination of two errors, and almost half (49. 73%) could have been converted into 
Correct* headings only by changing the level of specificity, and by revising the term- 
inology, and by altering the form. . . 

"It was also found, contrary to the general assumption that failure in specificity 
almost always means that the reader is approaching his subject from too broad a 
point of view, that of those headings in which an incorrect level of specificity was a 
factor. . . 64. 82% were too broad and 35. 18% were too narrow. " 1/ 

Lilley then asks the rather plaintive question as to what would happen, given that his quite 
homogeneous group of subjects, all of them college graduates and all seriously interested 
in librarianship, could come up with more than 62 different headings, on average, for 
every heading actually used in the catalog, if his test group had included a larger number 
of subjects with more heterogeneous interests? 

In 1961, Macmillan and Welt investigated the duplicate indexing of 171 papers in a 
limited area of the medical sciences (1961 [389]). In only 18 percent of the cases was the 
indexing identical or nearly so. About a third of the papers had been indexed so differently 
that there was no common correlation. For the rest, terms were used in one case that 
were missed in the other. 

Some brief data on inter -indexer consistency is also provided by Kyle (1962 [342]) 
for two indexers applying her classification system to 246 arbitraily selected French and 
English items in the field of political science. Of these, 160 were indexed the same way 
by both indexers, fpr a consistency figure of 70 percent. Tritschler noted that no items 
were indexed the same way a second time as they were the first, in small-scale experi- 
ments involving 20 documents independently indexed by 7 different people, if 

Painter (1963 [460]), in her study of problems of duplication and consistency of 
subject indexing of the reports handled by the Office of Technical Services, proceeded by 
selecting items from the announcement bulletins of agencies contributing to OTS, having 
these items re-indexed in the various agencies, and comparing the results with the origi- 
nal indexing assignments. At ASTIA, 94 items were re-indexed, with 1, 239 terms having 
been assigned to them originally and 1, 119 assigned on the re-run. Overall, 62 percent of 
those terms originally assigned were also assigned the second time, and 69 percent of the 
second-time terms had also been assigned originally. However, 111 of the starred des- 
criptors (which are of the most significance in the ASTIA system) were used the first time 
and not the second, while 98 were used the second time but not the first. 



if Lilley, 1954 [360], pp. 42 and 43. 
2 f Tritschler, 1963 [610], p. 5. 
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At AEC, 96 items were re-indexed to the subject heading scheme used in Nuclear 
Science Abstracts . There had been 249 headings assigned to these items originally and 
406 were assigned on the second run, for an overall consistency rate of 54 percent, but 
with 53 percent of the headings used the second time not having been used the first. The 
sample checked at OTS consisted of 32 items to which 346 descriptors had been assigned 
the first time and 4 IB the second. The consistency was 65 percent with respect to the first 
run and 54 percent with respect to the second. Finally, at the National Agriculture Library 
99 items were checked, with results showing a high consistency rating and a similarity of 
indexing between the two runs of 86 percent. Painter concludes: 

"The consistency rates are not encouraging. Apparently there is little difference 
between preparation for a ma iual system and that for a machine system. The per- 
centages indicate that there is no significant difference between consistency where 
two or three headings are assigned and where twelve or sixteen are assigned. 
Therefore, we are left with the fact that regardless of these variables, consistency 
rates range between 60 and 72 per cent. " if 

Jacoby and Slamecka report even less encouraging data (1962 [293]). "In general, 
the inter-indexer reliability was found to be low (in the vicinity of 20 per cent), the intra- 
indexer reliability somewhat higher (about 50 per cent). " For a series of tests of indexing 
of a group of chemical patents by three experienced and three inexperienced indexers, they 
found that the beginners had average matchings among the terms assigned by them to the 
same documents of only 12. 6 percent and that even for the experienced indexers the 
average percent of matching terms was only 16. 3 percent. 2 f In other studies, tb.es e in- 
vestigators have explored the effects of various indexing aids upon the reliability and 
consistency of indexing, concluding that the use of prescriptive aids such as authority lists 
improves reliability and inter -indexer consistency from 8 or 9 percent to 33 percent, while 
those aids such as thesauri and association lists "which enlarge the indexer's semantic 
freedom of term choice" are detrimental (Slamecka and Jacoby, 1963 [560]). 

Rodgers in a study of intra-indexer consistency reports data for tne re -indexing, by 
the same person at a later date, of 60 documents dealing with the United Arab Republic 
taken from The New York Times. She reports that the average consistency over all 60 
documents was 59 percent. 3 J In a further study of inter -indexer consistency, 20 papers 
from Area 5, ICSI, were key -word indexed by 16 people all of whom were familiar with 
the subject matter, (although only 8 completed all 20 papers). Results are given in terms 
of the proportions of the total number of unique words chosen by 100 percent of the subjects 
(. 008) half of them (. 14) and only one of them (. 52). 4/ Study of the results in terms of 
the proportion of words selected in common by any pair of these indexers to the total 
number of different words selected by them both gave a. "grand mean agreement for all 
two-person combinations for the 8 subjects. . . [of]. . 24 percent against all 20 articles. "5 / 
The mean percentage of overlap between Luhn's word-frequency selection technique (as 
applied to the same papers) and any one or more indexers who agreed was . 15. 



\f Fainter, 1963 [460], p. 94. 

2/ Jacoby and Slamecka, 1962 [293], p. 16. 
3/ Rodgers, 1961 [504], p. 12. 

4 j Rodgers, 1961 [503], p. 50. 

5 J Greer, 1963 [239], p. 10. 
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Still further studies of indexer consistency investigated at the Information Systems 
Operation division of General Electric have just recently been reported (Korotkin and 
Oliver, 1964 [331, 332]). In particular, the investigators report on the effects of subject 
matter familiarity and on the use as a job aid of a reference list of suggested descriptors 
upon inter-indexer consistency. The material for test consisted of 30 abstracts drawn 
from Psychological Abstracts , to be indexed by 5 psychologists and 5 non-psychologists in 
two sessions, with and without use of the "job aid' 1 . Results in terms of mean percent 
consistency were reported as follows: 

Session X Session II 

"Group A (Familiar) 39. 53. 0% 

Group B (Non-familiar) 36. 4% 54. 0%" Xf 

Corroborating evidence of a generally low rate of inter -indexer consistency is 
provided by noting instances of duplicated indexing that may occur in regularly issued 
announcement bulletins. During current awareness scanning of the DDC (ASTIA) "TAB" 
in recent months, members of the staff of the Research Information Center and Advisory 
Service on Information Processing have caught more than 20 cases of duplicate and even 
triplicate indexing of the same item. (Two examples can be discovered in Figure 8 a and 
b). For the 52 independent assignments involved, for these items the average inter- 
indexer consistency is only 46. 1 percent. 

On the general subject of indexing consistency. Black comments as follows: 

"There have been enough experiments to indicate that there* is no consistency, or 
very little, between one indexing performance by a given individual and another 
indexing performance, at a later date, by the same individual. The same inconsis- 
tency has been discovered among different individuals all indexing the same docu- 
ments. Thus there is neither inter -indexer consistency nor intra -indexer consis- 
tency in any system that depends on human performance. " Zj 

There can be little doubt that the quality and consistency of most human indexing, 
practically available today, is not good. Much of it, because of time and other pressures, 
is either directly a word- extraction process, or it is inconsistent in assignment of many 
relevant descriptors and subject category labels. On the other hand, today's indexing, 
whether accomplished by man or machine, is probably no better and no worse than any 
other classificatory or indexing procedures. The only excuse, therefore, for choice 
between man and machine is the cost/benefit ratio which is related on the one hand to 
specific operational considerations and on the other to the question of whether or not 
various indexers, and various users, would agree with the machine as much as they agree 
with each other. 

Before turning to some of the operational considerations affecting the cost -benefit 
ratio, however, certain special factors should be briefly mentioned. 

7. 4 Special Factors and Other Suggested Bases for Evaluation 

The diffi culties and problems of evaluation so far considered are generally applicable 
to any indexing system, whether manual or automatic. Certain special factors arise, how- 
ever when we consider some of the proposed automatic assignment and automatic classi- 
fication techniques. In addition, the prospects for computer processing hold at least the 

T/ Korotkin and Oliver, 1964 [331], p. 7. 

2/ Black, 1963 [64], pp. 16-17. 
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Trans, of Okeanologiya (USSR), 1962, v. 2, no. 6 , 
pp. 961-969. Also from OTS for $.50 as rept. 

63 21431. 

Descriptors! ("Fishes), Scientific research, 
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it deals only with data of a limited corpus. 

The scope of the problem and statements by some 
other authors are presented. The procedure of 
investigation involved a study of adverbial 
sequences and occurrences of adverbials in 
reference to verbals. Four classification sort- 
ings mere used to aid the study. Tentative 
adverbial function classes were assumed. The 
results of the first three sortings were used to 
modify the tentative function classes. Tentative 
position classes were established. The fourth 
sorting was used to establish function-position 
classe*. (Author) 
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Adjustments are given for nonconstant natural 
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modal tolerance distribution is also given. 
(Author) 
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Research conducted in connection with the classi- 
fication of adverbials is presented in this 
paper. The resulting classification is tenta- 
tive because, among other reasons, it deals only 
with data of a limited corpus. The scope of 
the problem and statements by some other authors 
are presented. The investigation involved a 
study of adverbial sequences and occurrences of 
adverbials in reference to verbals. Four 
class! fication sortings were used to aid the 
study. Tentative adverbial function classes were 
assumed. The results of the first three sortings 
were used to modify the tentative function 
classes. Tentative position classes were estab- 
lished. The fourth sorting was used to establish 
function-position classes. Criteria for de- 
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promise of more objective measures of performance or quality than evaluative techniques 
available today. 

Examples of the special factors involved in assignment indexing techniques and 
automatic classification include the question of the amount of computation required in the 
inversion and other manipulations of large matrices \f and the cone omm ‘ tant problems cf 
how large a vocabulary of clue words can be used effectively and of whether some docu- 
ments cannot be indexed at all because they contain none c£ these words. 2 j There is, as 
Needham says, "no merit in a classification program which can only be applied to a couple 
of hundred objects. " 3 / 

In the various techniques for automatic clustering or categorization of documents, 
there are serious questions of whether the groupings can be conveniently named or dis- 
played for the benefit of the user. 4/ Another example of special factors in the appraisal 
of an automatically generated classification scheme is as follows: 

"Operational testing is displeasing in that it puts off any verification until right at the 
end; it is expensive; there is not much experience on how to do it in a realistic way; 
and it is ill -controlled in the sense that the practical performance of a system is 
influenced by many other factors than the classification it embodies. " 5 / 

Examples of suggested bases for evaluation made possible by machine processing 
itself include proposals by Doyle and Garvin, among others. Doyle in particular suggests 
the substitution for the elusive concept of "relevance" of criteria based on "sharpness of 
separation of exploratory regions in which the searcher finds documents of interest from 
those in which he does not find such documents. " 6/ He further emphasizes the need for 
discriminating a particular document from other topically close documents (Doyle, 1961 
[ 166]) and suggests that "this decision can never be made by a human---only by a com- 
puter, which is the only agency capable of having full consciousness of the contents of a 
library. " ij Garvin considers the more general problems of language and meaning, and 
suggests that there are two kinds of "observable and operationally tractable manifestations 
of linguistic meaning", ---namely, translation and paraphrase, and that these may be 
investigated by techniques of linguistic data processing. 8/ Edmundson, however, points 
out that while there is in general only one translation of a document, there may be as many 
abstracts (and, by implication, index sets) as there are users. 9/ Thus we are back again 
at the questions of purpose and relevance. 

Compare Williams, 1963 [642], p. 162. 

2 / See Maron and Borko, various references. 

3/ Needham, 1963 [433], p. 8. 

4 / See, for example, Doyle, 1963 [ 162], p. 6: "Several researchers have tried to 

group topically close articles, usually by statistical means, but it is rather difficult 
to get any benefit from this grouping unless you can represent these groups for 
human inspection. " 

j>/ Needham, 1963 [432], p. 2. 

6/ Doyle, 1963 [164], p. 200. 

7/ Doyle, 1961 [169], p. 23. 

8/ Garvin, 1961 [224], p. 137* 

9 / Edmundson, 1962 [178], p. 4. 
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8. OPERATIONAL CONSIDERATIONS 



Whatever the verdict of evaluation of one or more automatic indexing technique s, 
whether of the derivative, modified derivative, or assignment type, there are certain 
operational considerations and problems that typically affect any attempt to apply such 
techniques in actual production operations. These considerations, which also affect lin- 
guistic data processing operations in general, include input considerations, availability of 
methods or devices for converting text to machine -usable form, programming consider- 
ations, questions of format and content of output, and problems of customer acceptance of 
the machine products. 

8. 1 Questions of input 

Input considerations include, first, questions of the extent and availability of mate- 
rial which can be handled directly by the machine. This may be limited to title only, to 
title plus abstract, title plus other material, JL / preselected text or automatically gener- 
ated extracts; or it may in a few cases extend to full running text. Possible future re- 
quirements may extend to the processing not only of full text but of interspersed graphic 
material (equations, charts, diagrams, drawings, photographs) as well. 

We have considered typical arguments for and against the limitation of input to Mtles 
only, to augmented titles, and to abstracts in other sections of this report. The points to 
be emphasized here are requirements for pre-editing or post -editing, provisions for error 
detection and error correction, the time and cost requirements of conversion equipment if 
material is not already available in machine -usable form, and the like. As Cornelius 
suggests: 

“Present day computers, if used for machine indexing, will be generally input 
limited and will require excessive data preparation. Causes of these limitations 
are: time required for translation to machine language, verification of this ma- 
chine language, and the capability or lack of capability of correction in the input 
media. 11 2 j 

Examples of pre-editing requirements, even for the simple case of keyword -in - 
title indexing, include the spelling out of chemical symbols, the encoding or the omission 
of subscripts and superscripts, insertions of hyphens to prevent indexing of a word, and 
substitutions of blanks for hyphens in compound words to assure indexing of each com- 
ponent. 3 j For full text, a far more extensive and elaborate set of rules and conventions 
must be developed and applied. 4 / Other editing may be required for format standard- 
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Tbis may specifically include cited titles, as suggested variously by Bohnert, 1962 
[69], p. 19; Giuliano and Jones, 1962 [229], p. 10; Swanson, 1963 [580], p. 1; 
Gallagher and Toomey, 1963 [205], p. 53; and as used in the SADSACT method, see 
pp. 98 - 99 of this report. 

Cornelius, 1962 [140], p. 42. 

See, for example, Kennedy, 1961 [311], p. 120. 

See, for example the sophisticated proposals of Nugent, 1959 [441], and Newman 
et al, I960 [439] • 
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ization, especially in the case of citation indexes compiled by machine. 1/ O 1 Connor notes, 
however, that "the provision of pre-editing information can slow down the keypuncher or 
typist, increase the chance of mistakes, and require more intelligence or training on the 
typist's part. " 2 j 

Questions of error detection and error correction apply both to the original text and 
to transcribed versions if these are necessary. That is, the basic docVments themselves 
may contain typographical errors, misspellings, and the like, and additional errors are 
bound to occur at all subsequent stages requiring human processing. %yllys discusses 
need for the correction of spelling errors, mentions suggested computer programs for 
detection, and cites a private communication from Stiles suggesting that the criteria for 
accepting words as valid be either that they are identified as already being in the system 
vocabulary or that they occur at least twice in the input item. 3/ 

Swanson’s analysis of the reasons for retrieving irrelevant, and failing to retrieve 
relevant, material in the case of text searching on the nuclear physics abstracts includes 
typical data on the effect of errors. 4./ He found, for example, that failures to record 
hyphenated words, subscripts, superscripts and other special symbols accounted for about 
5 percent of failures to retrieve relevant items, and errors in transcription of either text 
or search instructions accounted for another 3 percent of these failures. Errors in key- 
punching of the search requests alone accounted for 4 percent of the cases of irrelevant 
retrievals. By contrast, in the newspaper clippings experiments where the input material 
was already in machine -usable form transcription errors were not a factor but the input 
tape itself had many errors. In this specdal case, however, Swanson reports: "Garbles 
are not important simply because messages are sufficiently redundant to insure that even 
if one or two keywords for a given category are garbled, almost invariably others are 
present. " Sj 

The news clippings material used by Swanson represents one class of materials that 
are today initially available in me chine -usable form, because the original recording of the 
message or text resulted in a machine -usable medium, such as punched paper tape. A 
punched paper tape is produced as the product of many typesetting operations, especially 
for newspaper and magazine publication, and this will be increasingly true in the future, 
together with computer -prepared tapes for input to automatic typographic composing 
equipment. To date, however, equipment to convert from these tapes to the particular 
machine language of a given computer processing system is largely non -available, is 
costly, and is highly subject to error. 6/ 



1 j bee, for example, Atherton, 1962 [25], p. 4; Marthaler, 1963 [399], p- 22. 

However, at least one computer program has been developed to assist in this pro- 
cess. See Thompson, 1963 [600], p. II- 1: "The present program takes biblio- 
graphic citations and automatically arranges then into a standard format in such a 
way that the various parts of the citation are unambiguously identified. These 
standardized citations can later be processed by sorting and matching procedures to 
identify similar citations and to effect various rearrangements. " 

2 j O'Connor, i960 [444], p. 8. 

3/ Wyllys, 1963 [653], p. 15. 

4 / Swanson, 1961 [586], Appendix. 

5 j Swanson, 1963 [580], p. 5. 

6/ Compare, for example. Savage, 19&S [521], p. 11: "'The use of tape as the 

original input to the process has offered a number of problems which have yet to be 
solved. One is the occurrence of typographical errors. " 
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Moreover, to date, very little material in the scientific and technical literature is 
available in this form. As of 1961, it was reported that a survey by McGraw-Hill indicated 
that only about 2 or 3 percent of the publications in the United States were then prepared by 
typesetting tape, chat most of this was in the form of Monotype tape which because of its 
30 -column width and special format is not generally compatible with tape reading equip- 
ment, and that tapes had many errors in them which would require considerable effort to 
correct. As of late 1963, Bennett reports; 

"Computer processing of natural language text material requires that a body of data 
be available in machine -readable form. At present such a body of data results only 
from a direct human copying process. An inquiry into existing transcriptions of 
text which were machine -readable showed that they were abbreviated both in terms 
of completeness and in number of symbols represented. As an alternative text pro- 
duced as a by-product of typesetting operations is clearly an eventual possibility, 
but present practices make the detection of unit delimiters such as ends -of -sentences 
difficult. " Z/ 

In the future, both machine -usable text from publishers and printers and the similar- 
ly machine -usable paper tape produced as a byproduct from the original keystroking of 
manuscript on such equipment as Flexowriters and Justowriters may alleviate this problem 
for new items. Nevertheless, the wealth of the world's present literature, the informal 
and unpublished technical reports of high current interest but limited initial distribution, 
and material acquired from foreign sources, will continue to pose for the foreseeable 
future major problems either of automatic reading of the printed page cr of human re - 
transcription at high cost. 

While there have been many promising developments in automatic character recog- 
nition techniques, tne devices that are now available for production use are limited to 
small character sets, such as a single alphabet in a single font, often of special design. 

The multi-font page reader is not only not yet commercially available but may not become 
so for some years to come. Even if it were, there are many unresolved and as yet in- 
completely specified problems involved in the development of suitable rules for the machine 
so that it can distinguish between title or page number and text, figure caption and text, 
author's name in a cited reference and the title of the paper cited, and the like. A case £n 
point, not only for automatic reading equipment of the future but for machine processing 
of machine -usable material available today, is the difficulty of machine recognition of 
punctuation marks as used for different purposes. 3/ 

In the absence,, then, both of scientific and technical documents already in machine 
language form and of character recognitior equipment capable of reading the printed page, 
we are left with the unsatisfactory situation of re -transcribing input material either by 
use of a tape typewriter or by keypunching to punched cards. That this situation is un- 
satisfactory and is a major bottleneck in machine processing of text in excess of the 
bibliographic citation data only is evidenced by such typical statement as these; 



y Cornelius, 1962 [140], p. 47. 
y Bennett, 1963 [50], d. 141. 

y See Bennett quotation above; L-ulm, 1959 [384], p. 22, and Coyaud, 1963 [143]. 
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"The expense of transcribing such documents in their entirety will be justifiable to 
a limited extent only and it ms/, therefore, be assumed that automatic processing 
will be mainly applied to future literature. " _l/ 

"As long as we are limited to using the equipment that is available now, the pre- 
paration of data for input will be an expensive procedure and a major cost factor in 
automatic processing of natural language. " Zj 

"... In a discussion of indexing by machine, we must recognize the preparation of 
input to the system as the major item of cost of operation. " 3 J 

"Present inability to read documents automatically would make it necessary to punch 
cards or tapes, an operation likely to be even more expensive than reading by 
humans. K 4 f 

In addition to the high costs of manual retranscription, it i~ also noted that keypunching 
"tends to undermine >he purpose ox natural text retrieval K y requiring human effort at the 
Input end of the process. " 5/ 

In particular, keypunching or keystroking requirements undermine the purposes of 
rapid indexing as well as filing for retrieval by virtue of the time required to transcribe 
text. Horty and Walsh report, for example: 

"Flexowriter operators can produce between 1400 and 1800 lines per day of statutory 
text. Keypunch operators used in previous experiments could punch approximately 
100 lines per hour of alphabetic materials, but could not maintain this rate for a 
sustained period of time. " 6/ 

Thus, until such time as more versatile character recognition equipment is available, 
even some of the most ardent advocates of full text processing are forced to the use of 
considerably less than full text for other than research purposes. Swanson comments, 
for example: 

". . . One must note that the manual recording of text may be exorbitantly expensive. 
If so, a judicious selection process may permit a reasonable compromise between 
the expense of input and the depth of indexing which results. For example, it is 
reasonable to select the title, c.bstract, table of contents (if any), sub -headings, and 
key sentences or paragraphs. " 7/ 



1/ Luhn, 1959 [384], p. 2. 

2/ Ray, 1961 [496], p. 51. 

3/ Howerton, 1961 [282], p. 327. 

4/ Levery, 1963 [359], p. 235. 

5/ Doyle, 1959 [ 168], p. 2. 

6/ Horty and Walsh, 1963 [280], p. 259. 
ij Swanson, 1963 [580], p. 1. 
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"Costs come much more into line if we make available to the machine something on 
the order of one per cent of the full text. Then, of course, the problem of select ing 
that one per cent presents itself. " _1 / 

8. 2 Examples of Processing Considerations 

A second major area of operational considerations involves the machine process ing 
problems, given a specified input. For most of the automatic derivative, and modified or 
normalized derivative, schemes, this is primarily a question of the limitations of machine 
language to a vocabulary of, typically, no moie than 64 distinct characters for input, 
internal manipulation, and output. In addition, the limited numbe r of characters that can 
be packed into a single machine-word complicates internal processing, storage, file look- 
up (i. e. , against exclusion or inclusion lists), and sorting operations. 

Arbitrary truncation of text words to, say, 6 characters per word, leads to certain 
computer processing or storage economics. However, it leads also to complications in 
the selection of words either to be included (clue word lists) or excluded (stop lists) in 
many of the proposed methods both for derivative and for assignment indexing. Additional 
problems of artificial homography are created. Obvious examples are "Probab-le, -ility"; 
"Condit-ion, -ional, " "Freque-nt, -ntly, -ncy, " "Commun-ity, -ication;-al", and the like. 
Barnes and Resnick include in their studies of the effectiveness of an SDI System 2/ the 
use of 6 different truncation levels (from 4 to 9 characters). No significant differences 
were found in terms of the numbe r of hits (matches of a new item to a user’s profile which 
he considered to be of definite interest to him) but there were significant differences in the 
number of notifications sent him, as presumably matching his interest, and the amount of 
"trash" (irrelevant items) among these notifications. 

The importance of the selection criteria in derivative indexing, operationally con- 
sidered, is largely a matter of the length and the contents of the stop lists. Variability in 
practice among the various producers of KWIC indexes has previously been noted, 3/ but 
there are some interrelated and interlocking factors which affect the quality, the costs, 
and the customer acceptance of this type of machine -gene rated index. First, the number 
of pages in a printed index is directly related to the total costs of producing that index. 4/ 
The amount of material covered on a single page can be increased by photographic or other 
type of reduction (e.g., the 96 lines per page of the Bell Laboratories KWIC program out- 
put are reduced by xerography to 62 percent of the machine output page size), (Kennedy, 
1961 [311]) but the reduction must not be such as to exceed reasonable limits of legibility. 

This, in turn, means that the number of entries generated for each title (obviously, 
a function of the words that survive stop list purging) u eeds to be held to a reasonable 
minimum. Thus: 

"One of the major limitations of the published index stems from the conflict between 
the quantity of text that must be placed between the covers and the capacity of the 
printed page to handle it. The size of the page and the legibility of the printing 
determines the maximum density of characters which can be read without special 
aids. " $J 
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Swanson, 1962 [584], pp. 470-471. 

Barnes and Resnick, 1963 [36]. See also p. 148 of this report. 
See discussion, pp. 65-66. 

See Markus, 1963 [394], p. 16. 

Taine, 1961 [592], p. 153. 
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The question of stop list effectiveness therefore becomes an operational factor as well as 
one that may affect the quality and acceptability of the product. On the other hand, too 
generous a purging of the input titles may of course reduce the utility of the title index by 
the elimination of too many potential access points and, in particular, many that users 
may be most tempted to look for. 

A related problem has to do with the number of pages required because of the length 
of the title line allowed in the listings. A suggestion advanced by Brandenberg (1963 [80]) 
is the assignment of numeric codes to the machine stop words used and the insertion of 
these codes into the listed title line in the place of these presumably insignificant words. 
Thus one of the KWIC entries for the title, ’’Determining Aspects of the Russian Verb 
from Context in Machine Translation” might go from: 

RMINING ASPECT OF THE CONTEXT IN MACHINE TRANSLATION. /DETE to: 
ERMINING 032 416 712 RUS CONTEXT 308 MACHINE TRANSLATION. /DET 

This particular example was picked at random from a KWIC index utilizing a 103-106 
character title line, but it was deliberately shortened to the 60-character line length 
found in many such indexes in order to illustrate effects of chopping and wrap-around. 
Coincidentally, it also illustrates some of the difficulties of designing a well-balanced 
exclusion list since in this case the purged word ’’aspect” is apparently being used in a 
technical sense rather than in the common one of ’’Various aspects of. . . ”. By accident, 
this case does show rather severe ’’aspects” of the chopping problem in the loss also, for 
this entry, of ’’Russian" and "verb” although they would of course be picked v.p in the entry 
blocks for these words. Certainly, however, the claimed advantages of corcext checking 
are not striking, even without the introduction of the numeric codes. It is true that for 
excluded words longer in length than those in our example the possible conservation of the 
character -space to reduce the chopping effects for the same length line may result in im- 
provements. However, the replacement of, for example, "Preliminary investigations 
of. . . ” by numeric codes would hardly assist the user in determining quickly from the 
many possible entries under ”. . . ” which he should select for further personal perusal. 

Turning to the case of automatic assignment indexing, the processing considerations 
likely to be involved in operational factors affecting the evaluation of a system are much 
less easily exemplified. Obviously, conditions that hold for research experiments on 
small (and usually, especially selected) samples do not necessarily relate to requirements 
in potential productive applications. Exceptions are the problems of the sizes of term- 
term and term-document co-occurrence correlation matrices that can be readily manipu- 
lated, previously mentioned, Zj and the concurrent problems of the size, and hence the 
representativeness, of inclusion lists or clue -word vocabularies that can be accommodated. 

Both Maron and Borko found, even in their limited test samples, a certain proportion 
of new items that could not be indexed or categorized at all because these new items did 
not contain any of the clue words recognizable by the system. 3 j Due perhaps to longer 
selective clue word lists, as well as to the special nature of his items, Swanson found no 
instances, for 775 test items, of failure to assign because of iack of indicative clues in the 
input material. In the case of 60 tests against the SADSACT model, which uses approx- 
imately 1, 600 words drawn from a ”teaching sample” of items previously indexed to de- 
scriptors, (related by frequency of co-occurrence to any of 70 -odd descriptors with whose 
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Walkowicz, 1963 [629]* pp. 136 and 137. 



See pp. 108 and 160 of this report. 

See Maron, 1961 [395]; also Borko and Bernick, 1963 [78]. 
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assignment they had co-occurred), the machine had a sufficient basis in the input material 
for the derivation of a selection-score for at least 12 descriptors for each new item. The 
items were closely similar to, though not identical with, the source items from which the 
word associations with descriptors assigned had been drawn. The sample is obviously 
critically small. Nevertheless, the possibility that extensive clue word lists, notwith- 
standing the incorporation of trivial and even erroneous associations, can be used as 
effectively as smaller, more precise, and more carefully tailored lists, but with signifi- 
cant gains in memory space or computational requirements, is suggestive. A somewhat 
related conclusion, again reflecting the effect of processing requirements, is stated by 
Needham as follows: 

"The main point to be made is that theoretical elegance must be sacrificed to com- 
putational possibility: there is no merit in a classification program which can only 
be applied to a couple of hundred objects. " _l/ 

In KWIC type derivative indexing by machine, except in terms of allowable character 
sets and word -lengths conveniently processed, the problem of appropriate programming 
languages does not arise to any serious extent. For the processing of material in research 
on natural language text, however, the choice of interpretative and compiler types of auto- 
matic programming languages may involve computational requirements which, while being 
inappropriate in a production situation, offer considerable flexibility and versatility for 
experimental purposes. Examples of special programs of this type include the use of 
Yngve's COMIT by Baxendale ar.d Knowlton, the development and use of FEAT by Olney, 
Doyle, and others at SDC, and the use of list -processing techniques in the General Inquirer 
system. 2 J Yngve describes the use of his program as follows: 

"COMIT has also been used in the experimental work in information retrieval of 
Baxendale and Knowlton at IBM. The purpose of their COMIT program was to accept 
as input the title of a docume nt and to produce as output, not only descriptors, but 
pairs of descriptors which are roughly of the form adjective -noun. The purpose of 
the work is to automatically generate, from document titles, retrieval words of a 
more specific nature than simply Boolean functions of the existence of certain words 
in a title. " _3 / 

The FEAT program was designed originally for word and significant -word-pair 
frequency counts. Olney describes the program in part, as follows: 

"FEAT is designed to perform frequency and summary counts of words and word 
pairs occurring in its natural text input; i. e. , text written in ordinary English and 
transcribed into Hollerith code according to some set of keypunching rules. To 
focus attention on the semantic aspects of word pairs rather than on their syntactic 
aspect, pairs of which one member is a function word, such as 'the', ’is', 'by 1 , 
etc. , are excluded. " 

"Using a bucket list structure of the type proposed by C. J. Sheen in FN-1634, the 
program sorts each incoming word serially, constructing a list within each of 256 
buckets for good words of a given alphabetic range . . . and another list within each 
good word entry for the Doubles and Reverses which will be ordered alphabetical!-/ 



\J Needham, 1963 [433], p. 8. 

2 / atone, et al, various references, p. 137 of this report. 

3 / Yngve, 1962 [655], p. 26. 
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on that word ... If there are four different Doubls types of which the first word is 
'external' the addresses of the four different second words form a new list which is 
linked to the entry for 'external*. Each word type occurs only once in core, and all 
word pairs of which it is a member refer to it by means of its core addresses. " 

"The program could process millions of words, automatically generating frequency 
counts far larger than the Thorndike and Lange counts, which cost many man-years, 
and in addition, FEAT would provide complete lists of word pairs (Doubles and 
Reverses), which, so far as we know, have never been counted in a sample of appre- 
ciable size, despite their importance for semantic analysis of text. 11 

FEAT is used, together with a modified version of the Proto -Synthex program, and 
special output formatting routines, for another SDC program, the Descriptor Word Index 
Program, which produces a content -word-concordance for natural language text as well 
as statistics reflecting the type of words that occur, frequencies of occurrence, and posi- 
tional data, (Olney, 1960 [457], 1961 [456]; Stone, 1962 [574]. 

The IPL-V list-processing language is used by Kochen in some of his work on sim- 
ulated concept processing by machine. Programs for accepting sentences written in a 
formal language which was constructed of names and logical predicates (inserted either 
from a console or in the form of punched cards), for updating and re-organizing a file of 
such sentences, for storing and manipulating metalinguistic sentences such as "If X is 
author of Y and Y pertains to topic Z, then X has worked on Topic Z", for interrogating 
th^ fi-^, and for tracing associations between names linked through various predicates, 
have been written in this language, l/ 

8. 3 Output Considerations 

Turning to operational problems of output, the question of limitations of computer 
printout language to, in most cases, a single set of upper case alphabetic characters, 
numerals, and a few special symbols, Zj is a serious factor in customer acceptance with 
respect to appearance -- format, legibility, readability. Involved here are questions pre- 
viously mentioned. Where, in the only presently available outputs of machine -generated 
indexes, the KWIC type permuted title indexes, should the indexing access point "slot" be 
on the page? Should all or only part of the title be displayed? Should 60- or 106-character 
lines be used? More detailed discussion of these and related points are provider hy, for 
example, Youden (1963 [658]) Kennedy (1962 [311]) and Brandenberg (1963 [80]). 

A separate, but related question, is how much identification, and in what form, 
should be provided for the item itself either directly as a part of the index entry or by 
cross-reference to the address of more detailed information. There seems to be quite 
general agreement that the typical user needs something more than author's name and title 



_l/ Kochen, et al, 1962 [328], p. 34. 

2 / See, for example, Lipetz, i960 [365], p. 252*. "A disadvantage of keypunched cards, 
however, is the lack of capacity to record or to print other symbols than a one -case 
alphabet, one case of arabic numerals, and about a dozen punctuation marks and 
miscellaneous symbols. Citations in the scientific literature generally make use of 
a much larger number of significant symbols*, multiple cases, multiple fonts, italics, 
boldface, Greek letters, mathematical symbols, etc. " Note, however, that Chem - 
ical-Biological Activities, a digest produced by Chemical Abstracts Service, uses 
printouts of the modified IBM 1403 chain printer, using 120 characters (see Fig. 5). 
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alone to guide him. \J However, if the full bibliographic citation, perhaps th» abstract as 
well, is to be printed out by machine, the problems of limited character set are even more 
severe. This problem is today being solved, in some cases, by separate operations in- 
volving sorting and assembly of the full citations and abstracts of the items indexed, sepa- 
rately prepared, for photographic reproduction or typesetting. Hopefully, this partial 
solution will become obsolete as automatic type-composition equipment and computer-pre- 
pared typesetting techniques become more generally available. 

Operational considerations thus involve the costs, the availability, and the limitations 
of equipment now usable for machine -generated index production. Schultz and Schwartz 
report, as of October, 1962. 

"There are two major bottlenecks in automated index production caused by inadequate 
equipment development at the present state-of-the-art: 

"1. There is no way of using automatic input of the printed page or the 
indexer’s notes; 

"2. There is insufficient flexibility in the forms of output available for a 
computer -produced index. 

Both of these areas are being worked on by equipment manufacturers, and an early 
solution has been promised. " 2 j 

In general, operational considerations of this type do not affect the appraisal of auto- 
matic assignment indexing techniques, because these have not yet been developed to the 
point of practical application on any realistic scale. Moreover, the difficulties of problem 
definition and basic understanding of language and meaning yet remaining to be resolved 
are such that radical new advances in computer technology, associative memories, char- 
acter readers and pattern recognition devices may completely alter the picture before 
practical systems are ready for operational tests. Thus, for example, it is claimed: 

"It appears desirable to begin experimentation with automatic indexing so that solu- 
tions will become known by the time character recognition equipment will have pas- 
sed the laboratory stage. " 3 f 



Similarly, Doyle suggests that the "present rate of solution of the intellectual problems of 
IR is sufficiently slow that these advanced devices will be in common use long before IR 
will truly benefit from their presence", and he urges that researchers proceed as though 
such machines were already with us. 4 j 



if Compare, for example, Montgomery and Swanson, 1962 [421], p. 366: "This study 
suggests that indexing should be based on more than titles and that a bibliographic 
citation system should present to the requestor something more than titles"; See 
also, in addition to references cited, p. 61, footnote 1, IBM "ACSI-matic auto- 
abstracting project. . . Vol 3, 1961 [290], p. 89: "The use of titles in document 
searching without any additional abstract seems to lead to a high number of . . . 
errors, i. e. , accepting documents which should be rejected, as not enough informa- 
tion is available to judge the pertinence of documents. " 

2 f Schultz and Schwartz, 1962 [531], p. 432. 

3/ Levery, 1963 [359], p. 235. 

4/ Doyle, 1961 [169], p. 3. 
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9. CONCLUSION: APPRAISAL OF THE STATE OF THE ART IN AUTOMATIC INDEXING 



Notwithstanding the difficulties of evaluation we have discussed, we shall herewith 
attempt to evaluate the present state of the art in automatic indexing techniques, using such 
available criteria as seem most appropriate. First, we suggest that all of out initial 
questions except possibly the last, can today be answered affirmatively. "Is indexing by 
machine possible at all?" To this we can answer an unequivocal "yes" in view of the many 
examples of KWIC type indexes extant and in practical use. Secondly, "Is what can be done 
by machine properly termed 'abstracting', 'indexing', or 'classifying'?" If, by definition, 
word indexing of any kind is not "properly termed. . . indexing", then, as we have seen, 
automatic derivative indexing, such as KWIC, or the selection of words to serve as index 
tags based upon the frequencies of their occurrence in text, is not so either. 

The fundamental Luhn concept for indexing based on word frequencies is, as we have 
seen, straightforward: namely that, after disregarding the most frequent "common words", 
especially these that are syntactic-function words — articles, conjunctions, prepositions, 
and the like, together with those words that occur infrequently in a given text, the remain- 
ing high frequency words should give a reasonable indication of what the author was writing 
"about". Critiques of the Luhn position have been made on several-fold grounds : 

( 1) Information -theoretic - that, in fact, the most information is conveyed by 
the least frequent words. 

(2) Absolute vs. relative frequencies of usage within specialized fields. 

(5) Modifications of semantic purport by contextual and syntactic associations. 

(4) Problems of synonymity and, conversely, of orthographic ally identical 
words, l/ 

(5) Multi -aspect points of interest, and future need of access to material the 
author himself did not emphasize. 



The last point raises again the criticisms that have been made against derivative, 
extractive or "word" indexing of all types. To repeat, although such procedures may 
index "as the author himself indexed best -- in his own language", the significant points 
are (1) there may be peripheral, minor, or unrecognized aspects of his topic and incident- 
al information disclosed, of future interest to others, which the author himself is in no 
special position to recognize, and (2) notwithstanding the "author's own terminology" being 
current usage rather than the "fossilized" vocabulary of any previously established classi- 
fication or indexing scheme, this very "currency" changes from field to field and, quite 
literally, from day to day. Nevertheless, it should be re-emphasized that the validity of 
these criticisms is not limited to automatic derivative indexing as such, but rather is 
applicable against any indexing system whatsoever, mqnna.1 or machine, which is so 
strictly limited to author -terminology, author-emphases, and the consideration of the 
document at hand as a self-contained entity, without regard to other documents in a col- 
lection, in a particular field, and without respect to specific user needs. By contrast to 
this cype of limitation, more promising approaches should stress both similarities and 
differences between a new document and previously received documents, between docu- 
ments "belonging" to some definp^'e category, or not, and even, as responsive to a partic- 
ular user's pro file-of- interest 



\J See Baxendale, 1962 [42], pp. 67-68: "... resolution of orthographic ambiguities 
is a non-trivial and over-riding prerequisite for the computer processing of 
text. . . ", p. 67. 



Derivative indexing, whether by man or machine, is thus subject to many disadvan- 
tages. First and foremost, it is constrained by a particular individual's personal manner 
of expression of concepts in language. This limitation is controlled only by his presump- 
tive desire to communicate with some particular (more or less general, or more or less 
specialized) audience. His choices of natural language expressions, however, will be 
conditioned by at least some of the following factors: 



(1) The range and precision of his personal mastery of both general and 
specialized vocabularies for a given time, place, and specialized field 
of discourse. 

(2) His personal expectations as to the probable reactions (in the sense of 
eflective communication) of his intended audience to the expressions that 
he does choose, involving all of the problems of different usages of tech- 
nical terminology from field to field, from formal to informal presenta- 
tions, from scholarly reviews to progress reports heavy in current 
"technese" and "fashionable words". 

(3) His habits of thought and his training in his field. 

(4) His awareness of more than one possible audience and of more than one 
point or topic of potential interest to his readers. 



Secondly, indexing by the author's own words is remarkably sensitive to a particular 
period of time, so that the terminology becomes rapidly outdated and often seriously mis- 
leading in its connotations. Thirdly, the user has no advance knowledge of the terminology 
that has been used in all the varied texts of a collection and he must therefore be able to 
predict a wide variety of possible ways of expressing ideas in words, phrases, and even 
by implication. Fourthly, for collections indexed on a word -derivative basis, there is 
little or no possibility for generic searching. _l/ Finally, there is the more general 
question, applicable to both derivative and assignment indexing, of how well, ever, can a 
condensed representation serve the purposes of specific subject content recapture? In the 
strict sense, only by the elimination of truly redundant information. But even this is a 
relative matter. What is redundant for an author may not be so for several different po- 
tential users of the reports or papers that this author writes. What is redundant for one 
user is not necessarily so for others. 



The further problem for machine techniques is therefore: how selection rules can 
be provided that will replicate a given human pattern of selectivity, or, alternatively, how 
selection rules can be established and defined that will produce an equivalent and compar- 
able result - that is, one which typical users would agree is as pertinent to their query- 
answer relevance decisions as any available alternative. 

Certainly the problem of appropriate selection is at the heart of the matter. This is 
a crucial question, even if we sort out and can specify the different uses, for a particular 
collection, a particular clientele, at a particular time, that automatically generated con- 
densed document representations may have. Wyllys, in appraising automatic abstracting 
efforts, considers that the goal should be to provide extracts which will serve a search- 
tool function -- that is, they will furnish the searcher with enough information about the 
document content so that he may decide whether it is probably pertinent to his then interests 
or not and hence decide whether or not to read the document in full. By contrast, he says 
of the "content -revelatory function" that an abstract should: "furnish the reader with 
enough information about the related document so that in most cases he will not need to 
read it itself. " 2/ 



1 / 

l! 



See for example, Doyle, 1963 [162], with respect to lack of capacity for generic 
searching as one of the major disadvantages of natural text search systems. 

Wyllys, 1963 [653], p. 6. 
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Let us recall the objections to the use of the terms "auto-encoding" (or "auto-index- 
ing" or "auto -abstracting") because of the possible connotation of self -encoding, etc. . 1/ 
This is an objection based upon avoiding ambiguous or misleading terminology, but it also 
points to an objection as to the principle involved— that is, of treating the document itself ^ 
in its own right, as a self-sufficient, self-contained, universe of discourse, and of assum- 
ing that some type of summation-condensation over a number of different and indi 1 vi dually s- 
de rived representations of the separate documents in a collection can provide an effective 
selection -retrieval guidance system to the contents of various specific documents in that 
collection. Even when the actual operations are to be abetted by synonym reduction and 
normalization procedures (whether at the indexing or search negotiation stage, or both), 
there is a significant difference between this endogenous hypothesis and its exogenous 
alternative: that the basis for automatic indexing be the consensus of the collection, 01 of 
a sample of the collection, or of prior indexing. 

Assignment indexing, especially in the sense that concept-indexing is the goal, may 
be subjectively preferable to derivative indexing not only because it involves exogenous 
emphases but because i'c tends to delimit, centralize, and standardize the access points 
available to the user in his search -retrieval operations. However, in terms of the human 
indexing situation, it involves aU the traditional difficulties of indexing - which in turn 
invoke the problems of evaluating indexing systems: 

"Justification for any indexing technique must ultimately be based on successful 
retrieval. Success can only be evaluated in terms of a closed system; that is, a 
system wherein sufficient knowledge is available of the entire contents of the 
materials, so that an evaluation can be made of various techniques as to their 
retrieval effectiveness. The various systems . . . cannot really be weighed except 
on the basis of a test comparing one against the other. This has not been done in 
any place. " Zj 

Nevertheless, there are a variety of reasons for accepting even the relatively crude 
derivative indexing products as practical tools today, for seeking machine -usable rules 
for the improvement of these products, and for continuing research efforts in automatic 
assignment indexing and automatic classification. There are, first and foremost, the 
cases where conventional indexes are inadequate or non-existent. Thus Yfyllys claims: 

"It is well-known that the current methods of producing, through human efforts, 
condensed representations of documents are already hopelessly inadequate to cope 
with the present volume of scientific and technical literature. Many papers are 
never indexed or abstracted at all, and even in the cases of those that are indexed 
or abstracted, the indexes and abstracts do not become available until six months 
to two years after the publication of the paper. " Zj 

Again, with v* spect to automatic derivative indexing, especially KWIC indexes based 
on titles alone, there can be no question as to the evaluation criterion of timeliness . The 
success of this aspect is widely acknowledged by users, systems planners, and interested 
observers. On the other hand, there is very little reported evidence available on which 



\J See p. 3 of this report. 

Zj Black, 1963 [64], p. 16. 

3 / WyUys, 1961 [650], p. 6. 



any objective measure of comparative cost-benefit ratios may be obtained. Black reports, 
but without supporting data, that: 

"It has been estimated that the efficiency of KWIC indexing is about 76 per cent com- 
pared with about 82 per cent for conventional indexing or classification. 11 1/ 

White and Walsh report that: 

"From the limited experiment on methods of indexing the 1962 issues of the Abstracts 
of Computer Literature, the permuted title indexing retrieved only 52 percent of the 
information. This low percentage may be attributed to the changing and not yet 
uniformly standardized terminology existing in computer technology. " Zj 

KWIC indexes, because of their very currency, are fulfilling significant maintaining- 
awareness needs today. Improved titling practice, enforced by editorial rigor or contract- 
ual requirements or both, can improve their usefulness. They fill gaps in the bench 
scientist's or engineer’s ability to know about what might be of interest to him, either 
because the material is not otherwise covered in normal secondary publication (e. g. , con- 
ferences and proceedings of symposia, internal technical reports not produced t>n Govern- 
ment contracts and therefore not announced and indexed by the cognizant agencies, and the 
like) or because the sheer bulk of the product of indexing -abstracting services in his fipld 
prevents his effective use of these services unless more specific access points are pro- 
vided. The claim that "something is better than nothing" is not without merit, 3 J even 
with all the problems of non-resolution of synonymity, homography, topical scatter, long 
blocks of entries under the sorting term, the even more significant disadvantages of author - 
bias towards his principle topic, the author’s choice both of emphasis and terminology, 
and the like. Williams, considering word -with -context indexes, whether limited to title 
only or to titles with readily available augmentation, makes the following comments: 

"Limitations and other troublesome features of the method have been obvious, but 
perhaps over obvious, in the light of its growing acceptance and of the basic validity 
of permitting a document to speak for itself, even in a much abstracted recapitulation. 
Wherever there are large and growing problems in maintaining publication schedules 
for established subject indexes, or wherever pressing needs develop for more fre- 
quent indexes, for rapid, low-cost cumulation, or for indexes in areas where suit- 
able indexing services are wanting, there no apology is needed for proposing that 
this method be considered and tried, as a precursor to ’better’ indexing, if not as a 
substitute. Its use may be of interest also in less troubled circumstances, in its 
own right, and because of common elements involved in its production and the pro- 
vision of other wanted products and functions (catalog records, current-awareness, 
lists, etc)." 4 / 

Returning to the question of whether automatic indexing is possible, it can be seen 
that, at least in the derivative indexing sense, it is not only possible but can be practically 
useful. To dismiss the evidence of automatic derivative indexing operations that are in 
production today by rigorous definition of what indexing is in effect anticipates both our 



1/ Black, 1962 [65], p. 318. 

2/ White and Walsh, 1963 [639], p. 346. 

3/ See Veil! eux^ 1962 [624], p. 81: "Accepting the premise that partial control of in- 
formation satisfies more consumers than absence of control,’ perfection was traded 
for currency. 11 

4 / T.M. Williams, private communication, dated January 4, 1962. 
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third and fourth questions: whether machine -generated indexes are as good or better than 
the products of human operations and of how we can measure and appraise the adequacy of 
any indexing system whatever. Here are encountered the "core” problems of meaning in 
communicaticn, of information loss in any reductive transformation of actual messages or 
documents, of relevance of particular messages to particular queries and to particular 
human needs, of judgments of relevance. 

Because of these underlying yet overriding questions, the state-of-the-art in the 
evaluation oX indexing systems is in fact far more primitive than that of automatic indexing 
itself. An easy, and an early, solution is not likely. Therefore, today, in appraising 
machine potentials for assignment indexing we are faced with what is in effect a single 
criterion: namely, will a given group of human evaluators, whatever their standards and 
requirements, agree as much with the products of an automatic indexing procedure, other- 
wise competitive on a cost -benefit ratio with human indexing of the same material, as they 
do amongst themselves? 

Within the limits of small, specially selected samples of document or message col- 
lections, it is possible to demonstrate that: 

(1) Replication of the products of at least some existing systems, within the 
consistency levels observed for these systems, can be achieved. 

(2) Retrieval effectiveness with respect to relevant items indexed by auto- 
matic assignment procedures can be at least as good as, and may be 
superior to, that obtained from run-of-the-mill manual indexing of the 
same items. 

(3) Costs of indexing can be held at or below the costs of equivalent manual 
indexing, provided both that the input material required is already in 
machine -usable form, or can be held to an average of, say, 100 words or 
less, and that the clue-word lists, association factors, or probabilistic 
calculations can be accommodated within internal memory. 

(4) Significant gains in time required to generate an index or to index or re- 
index a collection can be achieved. 

Some degree of theoretical success in assignment indexing by machine can thus certainly 
be claimed. Moreover, many of the test results reported do clearly indicate a quality of 
indexing, for a given collection at a given level of specificity of indexing, at least com- 
parable to that which is typically and routinely achieved by people in a practical indexing 
situation. No more should be asked of the automatic techniques unless better human index- 
ing can be specified as being equally feasible, timely, and practical. Further, no more 
should be asked of automatic techniques in terms of the evaluation of their potentialities, 
than is now asked of the manually -prepared alternatives. \f 

Data with respect to comparison of the results of automatic assignment indexing 
techniques to either a priori or a posteri »ri human judgment have been mentioned previous- 
ly in this report in terms of actual test results reported, and the most significant of these 
reported data are summarized in Table 2. 2 / Typically, however, these data reflect, in 
varying degrees, so small a sample of test cases, of user preferences, and/or of special 
purpose and interest, that no general extropolation is reasonable. Moreover, the general 
questions of the "core" problems of evaluation in general again rear their own ugly heads. 



\J Compare, for example, Kennedy, 1962 [311] and Needham, 1963 [433]. 
Zj See pp. 101-103 of this report. 
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Thus, Borko and Bernick point out: 

11 Up to this point we have used human classification as our criterion for the accuracy 
of automatic document classification* Against this criterion we have been able to 
predict with approximately 55$ accuracy, and no more* Is this because out tech- 
niques of automatic classification are not very good, or is it because our criterion 
of human class ificaticn is not very reliable? There is some evidence to indicate that 
the reliability of human indexers is not very high* The reliability of classifying 
technical reports needs investigating and, perhaps even more basically, the reasons 
for using human classification as a criterion at all* " 2/ 

In general* the results of automatic index-term assignment procedures appear to run 
in the area of 45-75 percent agreement with prior human indexing, Zj ana this in turn is well, 
within range of, and often superior to, estimates of human inter -indexer consistency based 
on actual observations and tests* There can be little or no doubt that the resultr- of auto- 
matic assignment indexing experiments to date, (if extrapolation from the email and often 
highly specialised samples so far used in actual tests is in fact warranted 3 f) do suggest 
that an indexing quality generally comparable to that achievable by run-of-the-mill manual 
operations, at comparable costs and with increased timeliness, can be achieved by machine* 

The question which remains is simply that of practicality, today* Extrapolation 
from small samples is highly dangerous, as is well noted even, by enthusio.stis for machine 
techniques* The fact that for at least some sys terns, the limitations on number of clue 
words that can be handled (due in part to computational requirements, matrix manipulations, 
and the like) are such that, even in an experimental situation, certain "tests" are excluded 
from the result statistics, because the items contained an insufficient number of clues, is 
a serious indictment of reasonable extrapolations for these techniques today* Most tests 
so far reported have involved not only a highly specialized "sample" library or collection, 
but a severe limitation on the total number of "descriptors", subject headings, or classi- 
fication categories to be assigned* Maron uaed 32 r E^rko 21, Williams 20, SADSACT 70, 
Swanson 24* How would any of these approaches fare, given several hundred, much less 



1/ Borko and Bernick, 1963 [78], pp* 31-32* 

2/ See Table 2* 

5 j This is an important, perhaps crucial, caveat * See, for example, Gcldwyn, 1963 

[233], p* 321: "In the micro "experiments of many of those who would apply statis- 
tics! techniques * ** The document collection consists of 0-100 units. Results based 
on the manipulation, real or imagined, of such a collection can be valid for it, yet 
become shaky or even nonapplicable to larger collections 1 *; Perry 1958 [471], p* 415; 
"A degree of selectivity quite acceptable for files of moderate size may prove quite 
inadequate in dealing with large files* This fact often makes it necessary to exert 
unos'oal care and considerable reserve in evaluating the results of small-scale tests 
and demonstrations which may tend to cause the mass effects of large files to be 
underestimated or overlooked completely"; Swanson, 1^62 [586], p* 288i "The 
extent to which semantic characteristics of natural language are susceptible to being 
generalized from small sample data is deceptive* " 
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several thousand, possible indexing or class ificatory labels? J J 

The use of very brief short articles, or cf abstracts, as the members of experiment- 
al corpora for investigations of automatic assignment indexing techniques presuming the 
processing of full text, either for indexing purposes or for subsequent i: indexing -at-time- 
of search’ ', is seriously misleading. First, it is not truly representative of discursive 
text, either in vocabulary-syntax, or stylistic variations involving sync . .ymity, tropes, 
elisions, dangling referents, and inumerabl" other meaning -implications, not explicitly 
stated. 

Secondly, as any author of a technical paper, for which he must provide an abstract, 
knows all too well, he must concentrate in the abstract on a telegraphic emphasis toward 
his principal topic and the points he wishes to make. He must omit most qualifying, spec- 
ifying, and suggestive -of-other-leads -or -applications words and phrases, which he will in 
fact develop in the text itself. For this reason, even supposing that the author himself is 
unusually we 11 -a ware of the multiple points of access that many different potential users 
might desire, the required brevity of the abstract forrr almost necessarily demands terse, 
shorthand -type statements that can only increase the problems of "technese", of homo- 
graphy, and of single-subject representation. 

Granted, in either manual or machine-serviceable systems today, the current- 
awareness scanning need is largely met by indexing based solely or primarily on title only, 
or titl -plus -abstract. But is this good enough for search and retrieval? If and only if it 
is, then automatic indexing potentialities available today should be considered for both 
purposes. 

Our final question as to whether automatic indexing can be accomplished by statisti- 
cal means alone or must involve syntactic, semantic and pragmatic considerations is not 
entirely answerable. In terms of achieving comparable quality with many manually pre- 
pared indexes available today, statistical means alone do appear promising. But is the 
achievement of ju 3 t this level (even if accompanied by significant gains in timeliness, 
coverage, and economy) really gcod enough? There are a number of serious investigators 



— For example. Black predicts (1963* £64 j , p. 19) that for most systems an adequate 
vocabulary or thesaurus will comprise some twenty thousand terms. See also 
Arthur D. Little, Inc., 1963 £ 23 J , p. 65: "The enormous number of computations 
required increases very rapidly with the number of indexing ‘erms. Existing com- 
puters, operating serially, do not appear to be capable of handling the problem 
economically for collections with 9000 or more terms even if the simplest associative 
techniques are employed"; Williams, 1963 1.642 3* P* 162: "One of the practical 
problems. • - is in the inversion of large matrices. Xn certain methods the order of the 
matrix will equal the number of different word types in the population, which is 
usually in the thousands. " 



convinced that it is not, if and for thi3 reason, research efforts are being directed toward 
these other considerations. 

On-going research and development work - whether in modified derivative indexing 
approaching a "concept -indexing" level; in automatic assignment indexing techniques as 
such; in automatic classification or categorization procedures, or in potentially related 
efforts directed toward automatic abstracting, automatic content analysis, and other 
aspects of linguistic data processing - is both reasonably extensive and quite promising. 
Most of the investigators who are seriously active in the field report their current object- 
ives and recent accomplishments regularly to the National Science Foundation for publi- 
cation in the series "Current Research and Development Efforts in Scientific Documenta- 
tion. 11 In the most recent issue, unfortunately current only as of November, 1962, there 
are not less than 25 reports of KWIC and similar title-permuted derivative indexing 
methods generated or proposed-to-be-generated by machine, there are several instances 
of investigations into various possibilities of modified derivative indexing to be accom- 
plished by machine, and there are five to ten reports of active experimentation with various 
automatic assignment indexing schemes. These efforts and even more recently organized 
projects point in the hopeful direction that "KWIC indexes should be merely a sample of 
things to come". Zf 

Assignment indexing techniques so far investigated can be, as we have seen, of two 
types which are quite distinct in terms of the principles involved. The first, which can be 
the more readily mechanized, involve jj the use of thesaurus -type lookup procedures cover- 
ing the definable rules of "scope note'/ 1 , "authority lists", or "see also" reference prac- 
tice. The second type of assignment indexing, however, depends upon decision-making as 
to the propriety of assigning a particular indexing term to a particular document with 
reference to assignments to the collection as a whole (or a sample thereof). This latter 
type of assignment may be in terms of a priori categorizations of separable subsets of the 
collection. 

Alternatively, the bases for the latter type assignment -indexing procedures may be 
derived from a posteriori determinations of the suitable subsets as in the factor analysis 
experiments of Borko, the latent class analysis approach of Baker, and the clustering- 
clumping approaches to automatic classification of Needham and others. It is to be noted 
in particular that Needham thinks an automatically generated categorization is preferable 
precisely because of lack of knowledge as to the exact attributes defining a class in 



if See, for example, Climenson et al, 1962 [133], p. 178: "The statistical approach 
attempts to use no more than the occurrences of word spellings and their relative 
distances in the document environment . . . [and] cannot provide the discrimination 
necessary for most indexing and abstracting applications"; Doyle, 1963 [162], p. 3: 
"Automatic indexing and abstracting, as currently conceived, do not require any sort 
of dictionary or other semantic reference, but only counting, comparing, and so r ting - 
operations well known in numerical data processing. But success in applying such 
rules on a purely automatic basis can*t help but be limited"; Borko, 1962 [75], p, 5: 
"Although difficult, identification [of different meanings carried by the same word, 
of the same meaning carried by different words] must be accomplished before the 
automatic categorization of document content can be truly effective. For the most 
part statistical methods, and even syntactic analysis, are inadequate for the job. A 
technique of textual analysis based upon the semantic properties of language is need- 
ed"; Grosch, 1959 [244], p, ZOi "We need semantic methods . . . that will look for 
the intersection of redundant descriptors, each of which is at least slightly errone- 
ous. " 

2/ Doyle, 1962 [163], p. 381. 
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existing classification schemes. However, in the related field of pattern recognition Uhr 
and Vossler have shown promising results both for c rite rial feature analysis (a priori 
assumption as to attributes or properties gover nin g membership m specified classes) and 
for randomly generated discrimination operators which, applied in a -•ec.c*' j,*ve manner, 
are increasingly adaptive to the detection of class -mamber shin (Uhr and Vossler, 1961 
[615]). 

One particular way of looking at the problems of automatic indexing results, in 
effect, in placing these problems within the broader field of pattern perception and pattern 
recognition. We suggest that this is in fact a particularly fruitful approach. Certainly 
there is a wide area of potential commonality, and many promising leads for further re- 
search in automatic categorization can be found in the general pattern recognition litera- 
ture, especially in work on randomly generated operators and on the problems of deter- 
mination of membership in classes. Jj ./ Conversely, automatic classification techniques 
originally conceived as applicable to the handling of documentary iijformation have in fact 
been applied quite successfully to at least one case of groupings of physical objects on the 
bases of machine -detectable common properties. 

The question of determination of membership-in-classes is basic to the problems of 
automatic classification and categorization. Thus the techniques for discriminating the 
statistically significant associations between "properties” of objects or items that are to 
be grouped into classes or categories, even when such "properties" are not known in 
advance and have no a priori identification, point to an increasing and promising conver- 
gence of research in pattern recognition, propaganda analysis and psycholinguistics, math- 
ematics and statistics, studies of linear threshold devices, and the like, as well as in the 
linguistic data processing field as such. 

It is true that such synthesized "classes" may have no convenient "names" or 
linguistic interpretations which make much sense to the individual human searcher or user. 
Nevertheless, what is suggested is that a radical departure from conventional habits of 
literature search and retrieval may be desirable from the standpoint of effective use of 
machine potentialities. This might mean that, ab initio , the customer would pose to the 
system a search query request not couched in his notion of words or terms actually used 
in the system, but either (a) an outline or statement of his own research proposal and 
plan of attack or (b) an indication of one or several items that he has already decided are 
pertinent to his interests, with a request for "more like these". 

An equally radical departure from conventional present habits and thinking is already 
implicit in Needham* s suggestion of an automatically derived classification system and 
manual assignments thereto. Zj It would attack present-day machine capacity and proces- 
sing time limitations such that property and class or category associations must be held to 
something less than 1, 000 x 1, 000, unless prohibitive processing costs are to be incurred. 
This approach would assume a one-time large-scale building of vocabulary and term or 
category associations and derivation of assignment algorithms, and the printing out of the 
results in multiple copies for use by low-level clerical personnel carrying out, indeed, 
"machine -like" indexing. 

A final promising approach to the future prospects for hilly automatic indexing and 
categorization is the perseverance in research and development efforts in advance of the 



\f See, for example, Sebesyten, 1961 [539]* 1962 [538]. 
Zj Needham, 1963 [432], p. 1. 
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advent of versatile character readers and inexpensive, very large capacity, rapid direct 
access memories. These efforts will include not only further systematic exploration of 
syntactic, semantic and pragmatic considerations in linguistic data processing, but also 
further attacks on the problems of language and meaning themselves. Thus, we may con- 
clude with Maron that: "automatic indexing represents the opening wedge in a general attack 
at not only the problems of identification search and retrieval, but also the problem of 
automatically transforming information on the basis of its content. 11 \j 

If we are to attempt to solve this problem, as indeed we should, must we not look 
forward to the possibilities of rapid up-dating, thesaurus growth and revision, and quick 
and economical re-indexings of entire collections that only machine -pro cessing capabilities 
can promise today? 
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The term mechanized indexing can be interpreted in two different ways: as involving 
the use of machines to produce indexes once the index entries have been pre -determined 
manually, or as involving the use of machines to select the index entries as well as to 
prepare the indexes. 

The first interpretation, that of machine compilation of indexes is perhaps best 
represented by the progressively more sophisticated mechanization used for the production 
of Index Medicus from manual "shingling", through sequential card camera operations, to 
the computer-based system using a high-speed phototypesetter, the Photon GRACE 1, 2 /. 
As noted elsewhere in this report, machine capabilities have made practical the prepara- 
tion of citation indexes. In general, however, machine - compiled indexes work with the 
results of human intellectual efforts as applied in the subject content analysis of documents. 
We also find machines used to provide aids to the indexer. Two different tools may be 
employed to improve the quality of indexing. There are prescriptive aids in the sense of 
limiting and rigorously defining the scope of index terms to be used, and there are 
suggestive aids in the sense of provoking ideas about additional terms that might be used. 

The first type may involve a mechanized authority list or thesaurus used to normalize 
proposed index term entries, as has been demonstrated by Schultz _3 / and Schultz and 
Shepherd 4/ from i 960 onward. The potential value of this technique is indicated by further 
investigations of Schultz et al 5f in which it was found that index terms proposed by authors 
agreed more with terms employed by more than one member of a typical user group than 
did terms available in the document titles. Another example of developments in the use of 
a mechanized thesaurus is the system at Lockheed Missiles and Space Division, Palo 
Alto _6/ . 

This type of tool is used to check proposed indexing terms against the terms of the 
system vocabulary, to prescribe choices between synonyms and different levels of spec- 
ificity, and to supply syndectic devices such as "see also" references. Computer 
manipulations of thesauri can also be used to diversify search questions and to provide 
useful groupings of terms previously used in the system. The mechanized thesaurus can 
thus serve as the second type of aid by suggesting to the human indexer additional terms he 
might use. In effect, such a thesaurus provides a display of prior term-term, document - 
term and document -document associations observed in a particular collection, such as was 
demonstrated in the form of special purpose equipment in Taube’s "EBJAC " 7/ and the 
"ACORN" devices at A. D. Little Zf . 

The associational thesaurus can also be used to aid in the resolution of ambiguities of 
natural language and tc provide for updating in the light of changing terminologies or 
changes in the subject scope of a collection. What are the prospects for automatic updating 
and revision of a mechanized thesaurus? Luhn 9/ has suggested that a record of the num- 
ber of times words and groups are looked up would be "an indispensable part of the system 
for making periodic adjustments based on the usage of words or notions as mechanically 
established. " 

Another suggestion for the development of mechanized aids in human indexing proce- 
dures has been made by Markus 10/. This is to "explore the possibility of applying 
programmed teaching to indexing, with or without machines. " 

Machine -compiled indexes rest upon the efficacy of human indexing and there is 
increasing reason to doubt that this will be "good enough" for the future. It appears that 
there is a growing consensus with respect to inadequacies of present scope and coverage 
of indexing services. Cheydleur 11/ emphasizes that: "The cost of manual classification 
and abstracting of all the articles in the world's hundred- thousand technical periodicals 
would be fantastic. The practicality of carrying it out in a coordinated and timely way by 
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manual methods is unrealizable. There is also a pressing need to extend the coverage of a 
myriad of unpublished working papers. Hence, there is an utter necessity for automatic 
indexing, abstracting, and summarization by electronic data processors. " 

Secondly, little confidence can be attached to routine, manual operations to produce 
subject -content selection indicia for subsequent selection and retrieval of stored documen- 
tary items for the following reasons: 

1. Wide variations of intra- and inter -analyst consistencies occur in the 
assignment of content -indicia, even with respect to well-established client- 
interests and index term vocabularies. 

2. Potential clients may or may not be inclined to use the system, regardless of 
whether or not it provides efficient content-indicator -clue and selection 
criteria mechanisms. 

3. Future queries cannot, in general, be effectively predicted in advance, except 
for the cases of specific author or title retrieval requests. 

The problem of intra -indexer and inter -indexer inconsistency is of special interest 
because the degree of inconsistency will seriously affect search and retrieval effectiveness 
and because serious questions -ire raised with respect to the evaluation of any indexing 
system in terms of prior or independent human indexing. 

With respect to the effect of indexer inconsistency upon subsequent search effec- 
tiveness, O'Connor 12/ considers the possibilities of overassignment (i.e., the assign- 
ment of indexing terms to an item that a subsequent searcher would not consider pertinent 
to that item) in the case where a search is specified by iadjx terms A, B and C, each term 
is ov *r -assigned with ratio 1. 0, and assignments and overassignments by the recognition 
rules are statistically independent: "Then only one eighth of the papers selected by the 
conjunction of A, B and C would correctly have all three terms. " 

The complementary disadvantage of missing relevant references on search, because 
of indexer failure to supply all the appropriate indexing terms that a searcher would have 
considered relevant to a particular document would imply that, for a three -term query, 
assuming independence of term-assignments and a consistency level of 50 percent, only 
12.5 percent of the documents that the searcher would consider relevant would be retrieved 
if someone else had indexed these items. 

We have previously reported 13/ on the results of 700 simulated 3-term searches 
based upon both manual and machine indexing of approximately 20 items with respect to a 
fixed vocabulary of less than 100 allowed descriptors. These results show, that if indexer 
A assigns to a given document the term "A 11 as indicative of subject content, then his sub- 
sequent chances of retrieving that document with a query for term "A" are 58.4 percent if 
the item had been indexed by someone other than himself, and 55. 8 percent if indexed by an 
automatic indexing procedure developed at NBS, called SADSACT" (Self -As signed 
Descriptors from Self And Cited Titles) 14/. For three -term searches, any one searcher 
would be able to retrieve 26.4 percent of the items he would consider relevant to his query 
if they had been indexed by any of the other user -indexers, and 24. 7 percent if the items 
had been indexed by the machine technique. 

Tinker 15/ provides evidence on the relationships between inter-indexer inconsistency 
and retrieval efficiency, assuming that a given indexer is a potential querist, with average 
chances of retrieval ranging from 6.5 to 36 percent. Additional evidence on the generally 
unsatisfactory state of manual indexing consistency has been reported as follows: 
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1. Korotkin and Oliver 16 / report that five psychologists and five non- 
psychologists indexed 30 items with three descriptors per item. The task 
was repeated two weeks later with the aid of an alphabetized list of "sug- 
gested" descriptors derived from the data acquired in the first session. 

Mean percent consistency results were as follows: 

Session I Session II 

Group A (Psychologists) 39. 0% 53. 0% 

Group B (Non -psychologists) 36. 4% 54. 0% 

2. Evaluations of relevancy of selected items to a given search request have 
been explored by Badger and Goffman 17/ as follows: "Each of three eval- 
uators was asked to dissect the output into Levant and non-relevant 
subsets. . . A chi-square test was applied to the observed evaluation as 
compared to those expected if the three evaluators were in complete agree- 
ment. The chi-square test of 81.57 was very significant, indicating that there 
was an absence of agreement. " 

3. Greer 18/ reports on investigations of the interpersonal agreemen. s between 
subjects asked to list the search words they would use in posing queries in 
the field of information storage and retrieval systems. He found "a mean 
percentage consistency agreement of 26. 1 among subjects in stating search 
words. " 

4. Hammond 19/ provides a sampling of the use by NASA (National Aeronautics 
and Space Administration) and DDC (Defence Documentation Center) of a 
common set of indexing terms to index an identical set of 996 technical 
reports. In considering 3-term searches against the variant indexing shown 
in Hammond's tables, sample calculations show a 25-30 percent failure to 
retrieve potentially relevant items. 

5. In terms of intra- indexer consistency, Rodgers 20 / reports that: "A 
consistency of . 59 in selecting words to be indexed on two different occasions 
is not sufficiently high to give us great confidence in expecting a stable store 
when human indexers are used. " 

For these reasons, increasing consideration should be given to the second interpreta- 
tion of the term "mechanized indexing", that is, to machine generation of index entries, or 
automatic indexing . This typically involves machine processing of some natural language 
text, with severe problems of input. The first of several solutions involves use of 
automatic character recognition techniques to convert printed text to machine -usable form. 
This approach holds considerable future promise, but there are many current limitations 
and difficulties. 

A second possible solution, manual keyboard operations to produce a machine -useful 
transcription of a text, is plagued by high costs (i. e. , at least $0. 01 per word for unver- 
ified keypunching), and also by limitations of available time or manpower. 

A third alternative is suggested by current developments in computerized typesetting 
or tape -controlled casting or photocomposition machines. However, while such techniques 
promise major improvements for the automatic indexing of textual information to be pub- 
lished in the future, little can be done for already available literature, even with respect to 
the bibliographic citation information alone. Today's difficulties are emphasized by 
estimates of a cost of 30 million dollars to convert the present Library of Congress catalog 
to machine-readable form 21/. 



225 



Assuming.- however, that the input processing problems have been solved, we may ask 
what machines can do with respect to v/ords in texts, or in portions of texts, that are avail- 
able in machine - useful form? The machines can "read" the words for purposes of shifting 
and sorting and can copy or reproduce the words in some desired order, as in a machine- 
prepared concordance,. Machines can match input words with words already in store and 
thus exclude input words from further machine consideration (as by stoplists in KWIC 
(Key wor d -in -Con text) and other forms of derivative indexing) or stress certain input words 
with * *ference to a selective "inclusion" dictionary. 

Next, machines can tabulate and count, so that both absolute and relative word fre- 
quency data may be applied to either indexing or search-selection algorithms. Measure- 
ments of sequential u is tangos between selected words in the input text may also be applied. 
Machine look-ups against a master vocabulary can provide automatic supplying of syndectic 
information, synonym reduction, lexica 1 normalization, generic -specific subsumption, 
data with respect to previously observed word-word or word-s.ibject co-occurrences. In 
addition, information can be provided as to the possible syntactic roles of input words. 

In the light of such machine capabilities, what can be said the present state of the 
art in automatic indexing? Automatic indexing in the sense of machine -prepared indexes 
that are generated by the automatic extraction and manipulation of keywords, especially 
fr^Tn titles, is of course widely used *.n KWIC indexes such as Chemical Titles and many 
* .tiers both in the United States and elsewhere. 

Fischer 22 / provides a retrospective view of KWIC indexing concepts, including 
variants like KWOC (Keyword out of Context) and 'TADJEX (Words and Authors I ndex to 
Applied Mechanics Review ). She stresses the potentialities of linking such extraction 
indexing to selective dissemination systems and concludes: "Plans for using the 'Echo 1 
satellites to link information centers around the world, in a world wide drive toward im- 
mediacy in information dispersion, will surely provide a place for KWIC indexes and for 
the KWIC concept." Warheit 23 / also reports that consideration is being given to combining 
selective dissemination systems and KWIC. Fundamental questions remain: How useful 
and how much used are KWIC and other machine-generated indexes based upon the extrac- 
tion of words from a limited portion of the author's own text? 

These questions relate to an important distinction between two quite different types of 
indexing. The distinction is that whereas "derived" indexing takes as index entries the 
author's c.vn words in the title, the abstract or the full text, in "assignment" indexing an 
index term, descriptor, subject heading, or classification code is assigned to a document 
as an indicator of content and the term assigned does not need to be identical with any of 
the author’s own words. 

We can report continuing progress in use of derivative indexing techniques such as 
KWIC, and also in experiments with automatic assignment indexing and automatic subject 
classification. Timeliness of index production is cert?. inly one of the major virtues of 
KWIC. A similar timeliness is promised for automatic assignment indexing techniques 
provided that requirements can be kept sufficiently low with respect both to keystroking 
and computer processing. 

Intermediate results may be achieved by pre-editing, normalization, and post-editing 
techniques. Manual pre-editing to modify and supplement keywords in title, abstract, or 
portions of text has been used in permuted title and KWIC-type indexing from the pun'hed 
card system that began operation in 1952 24 / to the "notation -of -con tent" system developed 
for NASA 25 /. Kreithen.26/ suggests a combination of derivative and assignment indexing, 
?,s follows: 
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"The combination of these two automatic indexing methods, whereby a number of 
indexing terms would be assigned to a document on the basis of its category 
dependency; and the rest extracted from text, might be a desirable solution. " 

Automatic assignment indexing, with clue-words in the input textual material used to 
determine the proper assignments of indexing terms to incoming items, is generally equiv- 
alent to automatic classification techniques that assign a single classification category to 
items, again on the basis of clue -words in the input text, because a minimum cut-off level 
in the automatic assignment procedure, combined with a sufficiently generic vocabulary, 
can achieve classificatory as well as indexing results. The present state of the art in 
automatic assignment indexing and classification is marked by intriguing demonstrations of 
technical feasibility for the relatively small samples so far investigated. Present dif- 
ficulties associated with automatic assignment indexing or classification techniques, 
however, relate to problems of input processing requirements, computational limitations, 
th special purpose nature of results demonstrated to date, and problems of evaluation. 

A listing of automatic classification and assignment indexing experiments as of 1964 
is provided in Table 2, pp. 101-103, of the text of this report. To this we should add more 
recent results of our own as well as additional results reported by O'Connor 27/ and 
Williams 28, 29 /, Dale and Dale 30, 31 /, and others. 

In the SADSACT method, we start with a "teaching sample" of items representative of 
our collection, to which indexing terms have previously been assigned. We then derive the 
statistics cf co-occurrences of substantive words in the titles and abstracts of these items 
with descriptors assigned to them, ending with a vocabulary of clue words weighted with 
respect to prior co-occurrences with various descriptors with which they have been 
associated. 

Then, for new items, we look up each word of input (typically consisting of 100 words 
or less: title and up to 10 cited titles, or title and brief abstract, or title and first or last 
paragraphs) and derive "descriptor-se. action-scores" based upon the prior ad ho£ word- 
descriptor associations. The highest ranking descriptors, in terms of the accumulated 
selection scores, are then assigned, at some appropriate cut-off level, to the new item. 

To date, machine first- choice assignments (corresponding to performance figures 
reported for other automatic classification and indexing experiments) have been checked 
for 213 test items either against prior DDC indexing or against user evaluations, or both, 
with 72. 3 percent mean overall agreement. 

Our most recent results involved 150 test items. Machine assignments of descriptors 
to i^ems were checked by having up to five actual users of our collection rate the relevance 
to a given one of 14 descriptor? of items whose titles were listed under that descriptor by 
the machine assignment procedure. A total of 451 pairings of user -relevance -ratings with 
the machine has now been analyzed, with a mean relevance rating of 74. 9 percent. With 
respect to machine first -choices, there were 206 pairings with 85. 4 percent of the machine 
assignments rated as at least somewhat relevant. 

Checks have also been made of SADSACT results as compared to which of these same 
documents would be directly retrievable if a KWIC or some other title -only index were to 
be used. For the first 50 machine assignments rated as "highly relevant" in user- 
evaluations, a check was made to determine whether or not the same item would be 
retrievable by lookup under the name of the descriptor in a KWIC index. There were 9 
such cases, or 18 percent. In 48 percent of the cases, a part of the descriptor name 
occurred in the document title. For 17 cases, or 34 percent, there were no title words 
identical with any part of the descriptor name. 
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One evaluator was also asked to review the titles of 150 test items and to indicate 
which, if any, he would wish to retrieve under each of 14 descriptors. He requested in all 
353 items and 209 of these were retrieved on the basis of the SADSACT assignments, for a 
recall ratio of 59- 2 percent. Of these, 167 had been previously evaluated by the same user 
for an overall relevance ratio of 81.4 percent. 

Summary accounts of automatic classification and assignment indexing experiments 
have been provided by Schultz 32 / in the form of an "imaginary panel discussion" (in which, 
hypothetically, Borko, Schultz, and Stevens discuss their respective systems), and by 
Black 33 / who concludes: "Provided that overall effectiveness is nearly equal, the system 
that depends less on the human element would clearly seem to be more desirable from a 
standpoint of reliability and efficiency, and perhaps even from a standpoint of economics 
as well. " 

Additional work has been reported by Dale and Dale 30, 31 /, Damerau 34/, Dolby et 
al 35/, Kreithen _26/, O'Connor 27/, and Williams 28, 29 /, among others. Borko’s 36, 37 / 
more recent papers on this subject consider problems of reliability and evaluation. He 
reports comparisons of automatic and manual classifications of 997 psychological abstracts 
into 11 categories, factor -analytically derived from 65 percent of these abstracts used as 
source items. He concluded that it was possible to determine that the percentage of agree- 
ment between automatic classification and perfectly reliable human classification could 
reach 67 percent. 

O'Connor’s 1965 report 12 / provides further promising results of his "machine -like 
indexing by people" studies and also discussions of other techniques and of difficulties and 
limitations in automatic indexing experiments to date. Using Merck, Sharp and Dchme 
indexing data. O'Connor tested additional recognition- of -clue -word rules based on syntactic 
emphasis, a first sentence and first paragraph measure, a syntactic -distance measure, 
negations forbidden near clue words, and words naming substances or types of operations 
being required in close proximity to clue words. 

He reports considerable success with these new rules as follows: "The computer 
rules selected 92% of 180 toxicity papers. Allowing for sampling error, these rules would 
select between 88 and 95 percent of the toxicity papers. Thus the computer rules would be 
roughly comparable to, or perhaps superior to, MSD indexers in identifying toxicity 
papers. " 

With respect to the difficulties to be observed in automatic indexing experimentation, 
O’Connor questions the adequacy of samplings of subject specifications, documents, and 
collections, the size of clue word vocabularies, and the human judgments used as stand- 
ards in many of the studies that have been made. 

The question of sampling adequacy in terms of the representativeness of clue word 
vocabularies as related to index terms or classification categories may be particularly 
critical for methods using small teaching samples. Spiegel and Bennett 38 / report that: 
"There seems to be no simple relation between the size of the corpus and the size of the 
vocabulary but after a certain point vocabulary size increases very slowly. " 

Findings by Williams 29 / are encouraging. Working with teaching samples of 35, 70, 
and 140 items respectively, he reports that in the first 10, 000 word tokens processed from 
the text of 2, 700 abstracts 1, 800 different word types were encountered but that in the 
80, 000 to 90, 000 range only 255 new types appeared. He found further that "an increase in 
sample size beyond 140 would not appear to offer any significant increase in classification 
performance. " ; 
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Williams found an average correct classification of 62 percent for 474 test items 
automatically assigned to one of four solid state categories 28/. In other tests , 2, 754 
solid state abstracts were classified into three primary and three secondary categories, 
using a computer program capable of handling up to 50 clue words, 10 subject categories, 
and any number of documents. Performance effectiveness ranged from 62 to 88 percent 
correct by comparison with the original classifications at the more generic level and from 
67 to 92 percent correct at the more specific level. 

Further progress in the application of statistical association, clumping and syntactic 
analysis techniques have also been reported. Statistical association techniques are 
concerned with correlations and coefficients of similarity assumed to exist between items 
or objects sharing common properties, hi documentary item applications, document- 
document similarities are calculated for sharings of the same index terms or for common 
patterns of citing the same references, of being cited by the same other documents, and 
the like. Word -association techniques include the development of absolute or relative fre- 
quencies of co-occurrence in a given set of documents, such as those representative of a 
specific subject matter field. Various normalizing procedures can be used to remove 
effects of tendencies for certain words to occur frequently in general. Spiegel and asso- 
ciates 38/ at Mitre Corporation have explored means of normalization to eliminate effects 
of length of text strings, relative positions of words in a string, and vocabulary size. 

Ernst 39/ reports that at Arthur D. Little: "We are . . . seeking to provide a working 
retrieval system which will incorporate associative features. The objective will be to 
make use of automatically computed index term associations as a basis for detecting and 

presenting an appropriate list of near -synonyms for the concepts desired by a user 

essentially the automatic generation of a limited thesaurus in response to individual user 
requests. 11 In Switzer's model 40/, co-occurrence statistics of index terms consisting of 
words from title or text, author's names, and words and author names from cited titles, 
are used. Significant probabilities for such co-occurrences are then derived. 

Methods that group objects or items in terms of co-occurrence data for their prop- 
erties or characteristics are involved in the "clumping" techniques as proposed at the 
Cambridge Language Research Unit. Further investigations into the development of the 
basic CLRU approach have been conducted at the Linguistic Research Center at the Univer- 
sity of Texas, by Dale and others 30, 41 /. In this work, simulation of associative doc- 
ument retrieval by computer gave results for 260 computer abstracts, using the same 90 
clue words as previously used by Borko: "The recall ratios in the test requests were high 
(i.e. , very few relevant documents were not retrieved); relevance ratios were characteris- 
tically smaller (of the order of 10 percent). However, since the output lists are ordered, 
it is interesting to note that the relevance ratios are significantly much higher in the upper 
portions of the output lists (roughly between 25 percent and 50 percent in the upper fourth 
of the output lists), and that recall ratios are still of the order of 50-70 percent. " 

In 1964 a report of the Astropower Laboratory 42 / outlined a "semantic space 
screening model" based on the assumptions that keywords or phrases have quantifiable 
'values', that by itemizing the keywords in a document sufficient information is obtained 
for its classification, and that by adding the values for the keywords in a document the 
pertinence of that document to a particular subject field can be determined. A training 
sample consisted of 120 abstracts drawn from six subfields of electrical engineering. 
Results shewed successful classification of source items, using four different classifica- 
tion formulas, as ranging from 49 to 96. 3 percent. Results with test items ranged from 
32. 9 to 69. 0 percent accuracy. 
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The automatic indexing, selective dissemination and retrieval system design developed 
by Qssorio 43/ is based on a system vocabulary subsequently used for the automatic assign- 
ment of new items to appropriate locations in a pre-established "classification space". An 
"attribute space" may also be developed to identify the kind of information found in a doc- 
ument, e. g. , that it deals with concepts such as weight or physical size rather than with 
mathematical or space and time concepts. 

Both types of "space" in this system are constructed through the use of factor analysis 
applied to previously established relationships between the terms in the system vocabulary 
(approximately 1, 450 terms) and 49 subject fields and to relevance ratings of attributes 
with respect to items. Then, "documents are indexed by being assigned a set of coor- 
dinates in the classification space by mean; of the classification. Formula and the system 
vocabulary. " 

With respect to the use of linguistic techniques in automatic indexing and classification, 
methods of computational linguistics may be used to derive measures of the probable 
significance of words in document texts. Damerau 34 / reports experimentation with word 
subset selection for indexing purposes based upon word occurrence frequencies signif- 
icantly larger than expected frequencies (following Edniundson and Wyllys, in part), with 
encouraging results. Findings by Black 33/, Simmons et al 44/, Spiegel and Bennett 38/, 
and Wallace 45/, among others, suggest the need for continuing investigations in the area 
of proper discrimination between significant clue words and non -informing words for a 
particular corpus or collection. Extensive computer processing and analyses such as 
Dennis 46 / has applied to the legal literature are needed for other subject matter fields. 

The latter investigator warns that neither raw word frequencies nor the numbers of doc - 
uments in which a word occurs provide good criteria for distinguishing between trivial or 
non -informing and significant or informing words. She suggests, instead, that "discrim- 
ination increases with the skewness of the word distribution in the file". 

Baxendale has suggested that certain types of phrase structures and nominal construc- 
tions, as determined by relatively unsophisticated machine syntactic analyses, are useful 
in revealing appropriate subject-content clues. A recent example is provided by Clarke 
and Wall 47/: "The hypothesis is that the importance of nominal constructions in selection 
of index unit candidates places emphasis on the bracketing of all noun phrases. " 

Baxendale 's continuing work 48/ further suggests that "through the methods of statistical 
decision theory it is hoped to formulate quantitative measures that will separate inform- 
ative index terms from noninformative. " Continuing use of syntactic analysis principles is 
provided as an option in the SMART system (Salt on 49/) and possibilities for choosing index 
terms automatically by syntactic criteria have been explored by Dolby et al 35 /. 

Closely related to automatic classification or indexing experiments involving linguistic 
factors are document and word grouping investigations for homograph resolution and sub- 
ject field identification purposes, such as those of Doyle 50 / and Wallace 45/. Doyle used 
a Fortran computer program developed by Ward and Hook for iterative automatic groupings 
of 50 physics and 50 non -physics documents. He was able to show clear-cut separation of 
two meanings of words such as "force" and "satellite". 

A case involving overlaps of word memberships in more than one subject class has 
been investigated by Wallace 45/. Using word frequency data, lie found 48 words in com- 
mon on the first 100 word-frequency rankings for psychological and computer literature 
abstiacts, with function words predominating. However, using a word rank sum criterion, 
he was able to separate 50 psychological abstracts from 50 computer abstracts with 78 
percent success. 
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We may thus conclude that the progress and prospects of automatic indexing, as of 
September 1966, are both provocative and challenging. They are "provocative” because so 
much in terms of both practical and theoretical accomplishment has already been dem- 
onstrated, and "challenging" because so much remains to be done. Further, what remains 
to be done will in all probability require serious, intensive, and imaginative investigations 
of a wide variety of questions from the relative usage and acceptability of a KWIC index 
through possible changes in author and editor practices to the fundamental questions of 
semantics and human judgment. 

Nevertheless, when the results of automatic classification or automatic indexing 
procedures reach levels of 70 percent or better mean agreement either with human in- 
dexers or with potential users evaluating the relevance of items retrieved by such indexing, 
then the machine methods should be preferred to routine, run-of-the-mill, manual indexing 
wherever the costs are at least commensurate. 

The technical feasibility of achieving such performance levels for a relatively small 
number of classification categories or a relatively small vocabulary of index terms has 
already been demonstrated experimentally. There remain unresolved questions of the 
extent to which it will be possible to apply such techniques to the larger vocaoulary require- 
ments and the practical operating considerations in actual collections. 

Assuming that we can solve these problems, however, many advantages will accrue. 

First is the speed with which many items can be indexed in a few minutes or hours at 

most for, say, 10,000 items. Secondly, there are advantages of timeliness and the ease 
with which an entire collection can bo re -indexed or re- classified. A third advantage is 
the consistency of the machine procedures, especially as compared with the inconsistency 
to be noted in available data on tests of comparative performance among indexers. 

The advantage of ability to re -index quickly, easily, and inexpensively (because most 
input costs will have been incurred previously) is of major importance in terms of over- 
coming present barriers to the introduction of improvements in operating systems (since, 
as Kyle 51 / points out, "The most common reason for not trying new and/or improved 
techniques of classification and indexing is the difficulty of reclassifying and re -indexing 
large collections") and in terms of dynamic revision and up-dating (as Borko 37 / 
emphasizes). 

Another advantage; particularly of methods using teaching samples is (as suggested by 
Mooers as early as 1959 52/), the capability for making assignments of indexing terms in, 
say, an English language system to items whose texts are written in other languages: 
French, German, or Russian. This type of advantage can point the way to greater interna- 
tional collaboration in indexing and document control procedures. 

A further possibility is suggested by the convergence of automatic indexing techniques 
based upon teaching samples with adaptive selective dissemination systems and client feed- 
back possibilities, especially those involving "more-like-this !" requests. If we assume a 
large-scale, multiple-access system with adequate personalized files for the typical client, 
the common data bank of document identificatory and selection criteria, condensed rep- 
resentations, and full text (if available) can be selectively accessed by him on the basis of 
automatic indexing generated by his own choice of selection criteria and his own choice of 
exemplar items for each such criterion. 
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He may provide a standing-order interest profile with respect to patterns of his own 
selection criteria, with weighting indications as to relative degrees of interest. Dynamic 
re-adjustments to standing requests and weightings can be made in accordance both with 
his responses to notifications and with any "more -like -this" requests received from him. 
System accounting and usage statistics can provide a feedback warning system as to the 
adequacy of his selection-criteria set and enable him to initiate re-processing of those 
documents in the collection likely to be of current inter es^ to him. 

We must close, however, with a caveat : if machines have not yet mastered ns, neither 
have we yet the requirements of the machine to the degree of advanced planning that will be 
required, especially for those information processing operations involving the analysis of 
content and not merely the manipulation of records: for here we are faced with the great 
challenges of human communication, human decision-making, and human -problem- solving. 
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